How Can We Estimate RIR in Environments with Multiple Sound Sources?

2023.08.23 by Monica Lee

How Can We Predict RIR in Environments with Multiple Sound Sources?





Greetings! I’m Monica from Gaudio Lab, a pioneering hub for audio and AI research.


The recent introduction of Apple's Vision Pro has sparked renewed enthusiasm for Spatial technology. This development extends into the realm of audio, suggesting that by understanding a user’s spatial surroundings, a deeper audio experience can be curated. Coincidentally, Gaudio Lab embarked on related research last year, and I am excited to share our insights.






Understanding Room Impulse Response (RIR) : A Key Concept to Remember


If you’ve ever thought, “I want to make a sound seem as if it’s coming from a specific space!”, all you need to know is Room Impulse Response (RIR).


RIR measures how an impulse signal, like the gunshot from a man in Figure 1, reverberates within a given space. Any sound, when convoluted with the RIR of a specific space, can be made to sound as if it originates from that space. Thus, we can describe RIR as "data that contains invaluable information about that space".



Figure 1 – Source: https://www.prosoundweb.com/what-is-an-impulse-response/  



So, how can we estimate RIR?


The most accurate method to obtain RIR data involves direct measurements with a microphone in the targeted area. However, this method, while precise, is often cumbersome, requiring both specialized equipment and considerable time. Moreover, physical barriers might restrict access to certain spaces. Fortunately, the evolution of machine learning presents alternative RIR prediction techniques. For instance, one such study aims to predict the RIR of a space using only a sound recording (e.g., a human voice) from that particular area.




Can TWS Record Surrounding Sounds to Predict RIR in Real-Time?


For a genuinely immersive augmented reality (AR) experience, it’s essential that virtual auditory cues seamlessly blend with the user’s physical environment. This demands an accurate understanding of the user’s immediate surroundings. Our research dives into the potential of using True Wireless Stereo (TWS) to record ambient sounds, which are then analyzed through machine learning.


When predicting real-time auditory experiences within a user’s environment, it’s crucial to consider that multiple sound sources, such as various individuals and objects, will contribute to the soundscape. In contrast, much of the previous research has been tailored towards predicting Room Impulse Response (RIR) from audio cues of a solitary entity, or a “single source.”


While on the surface, RIR prediction for single and multiple sources might appear analogous, they necessitate distinct approaches and considerations. This divergence is due to the inherent variability in RIR measurements, even within the same space, based on the exact positioning and orientation of individual sound sources. While commonalities exist due to the shared environment, specific nuances in measurements are inevitable.  



Let’s only predict the RIR in front of us!


Predicting the Room Impulse Response (RIR) in environments with multiple sound sources can be challenging. At Gaudio Lab, we tailored our approach to align with anticipated scenarios for our upcoming products. Recognizing the needs of True Wireless Stereo (TWS) users, we determined that predicting the RIR from sources directly in front should be prioritized. As depicted in Figure 2, even when multiple sounds are recorded, our primary objective remains to estimate the RIR originating from a sound source situated 1.5 meters directly in front of the user.




Figure 2 – The user is at the center, surrounded by diverse sound sources being emitted from different positions (grey circles).

Despite the presence of multiple sound sources, our model consistently assumes a virtual sound source (blue circle) located 1.5 meters ahead and aims to predict its RIR.



Our AI model’s architecture takes cues from a relatively recent study. At its essence, the model takes in sound from a designated environment and outputs the associated RIR. Most previous studies primarily used datasets from single-source sound environments (Figure 3, Top). In contrast, our approach incorporates multi-source datasets (Figure 3, Bottom).


We constructed our dataset by convolving select RIRs from Room A with an anechoic speech signal. The model’s output provides a singular monaural RIR, corresponding to the user’s frontal perspective. We crafted a loss function to ensure the model’s output closely matches the true RIR.




Figure 3 – Traditional studies predominantly utilized single-source environment data for training (top figure),

whereas our method emphasizes datasets from multi-source settings (bottom figure).



The essence of AI system development lies in data collection, a phase in which we invested significant time and effort. While RIR data is plentiful, datasets that capture multiple RIRs within a single environment are scarce. Directly recording from countless rooms was a daunting prospect. To address this, we turned to open-source tools, enabling us to generate and utilize synthetic datasets effectively.




The Outcome? 


Figure 4 – Comparative analysis between the conventionally studied Single Source model (SS model) and our proposed Multi-Source model (MS model).

An uptick in both Loss and Error values indicates a decrease in performance.



When comparing the traditionally trained model using a single sound source (SS model, illustrated in blue) to our approach, which incorporates multiple sound sources (MS model, depicted in pink), distinct performance variations emerge. As highlighted in Figure 4, the efficiency of the single-source model diminishes as the number of sound sources increases from one to six. In stark contrast, our multi-source model consistently maintains its efficiency in RIR prediction, irrespective of the growing number of sound sources.


In real-world settings, it’s often challenging to determine the exact number of sound sources in a user’s vicinity. Therefore, a model like ours, capable of predicting the RIR of a space regardless of the number of sound sources, promises a more immersive auditory experience.



Curious about this research?


Based on our findings, I developed a system to predict RIRs in real-time from sounds recorded directly in three different office spaces of Gaudio Lab. Each space had its unique characteristics, but the model was able to reliably predict the RIR reflecting that space! During our auditory evaluation with the Gaudio team, the majority remarked, “It truly sounds like it’s coming from this space!” 😆


This research is also scheduled to be presented at the AES International Conference on Spatial and Immersive Audio in August 2023 (NOW!). If you’re interested, please check out the link for more details!





ICML Paper Preview: A demand-driven perspective on Generative Audio AI 

ICML Paper Preview: A demand-driven perspective on Generative Audio AI    (Writer: Rio Oh)      Greetings. I am Rio, a researcher at Gaudio Lab’s AI Research Team, dedicated to the development of the sound generation model, FALL-E. I have a keen interest in generative models, and my recent work involves the exploration of how generative model strategies can be applied to diverse tasks.   Our team has been reflecting on the essential areas that require enhancements for real-world, industrial applications. We aim to present these insights at the upcoming ICML workshop, and I’m pleased to provide a sneak preview of our findings!     Introduction   Through our blog, we’ve previously highlighted Gaudio Lab’s achievements in the DCASE Challenge.   We’re thrilled to share another significant milestone from Gaudio Lab: the acceptance of our paper for presentation at the ICML workshop (Hooray🥰).   ICML, in conjunction with NeurIPS, is recognized as one of the world’s leading AI conferences, attracting considerable global attention. Over the final two days of the conference period, thematic workshops are organized. Only papers that have passed a stringent double-blind peer review process are selected for presentation at these sessions.     Generative Audio AI: Pioneering the Field with Gaudio Lab   Comparatively, audio generation (excluding speech) is still in its infancy when contrasted with the text and image sectors. Looking beyond the familiar domain of text into images, a range of commercial and non-commercial services are already utilizing Diffusion models such as DALL-E and are easily accessible to the public.   However, audio lags behind with no public services yet launched, primarily due to technological maturity and computational resource constraints. (While sharing of demos and models for disseminating experimental results from papers is gradually increasing, the availability of services that the general public can utilize is virtually non-existent.)   In this challenging environment, Gaudio Lab aspires to create AI products that transcend mere demos, potentially revolutionizing existing paradigms. We’ve embarked on a process to evaluate and address the prevailing challenges and limitations. This important endeavor, aimed at tuning into the industry’s voice, seeks to bring research-stage audio AI products into mainstream visibility.   Through this, Gaudio Lab intends to stay connected with the broader industry context and its processes (while maintaining a strong focus on research), striving to refine our future research directions with greater precision.   We are excited to present our insights and learnings at the 2023 Challenges in Deployable Generative AI workshop! (Scheduled for: Fri 28 Jul, 9 a.m. HST & Sat 29 Jul, 4 a.m. KST)     [image = workshop poster]       Hold on, what exactly is Gaudio Lab’s FALL-E mentioned in the "Motivations" section?   FALL-E is an innovative technology from Gaudio Lab, utilizing AI-based Text-to-Sound Generation to create sounds in response to text or image inputs. It can generate not only actual real-world sounds, such as a cat’s cry or thunder, but also limitless virtual sounds, for instance, the sound of a tiger smoking – an imaginative representation of an unreal scenario.   This ability significantly broadens the horizons for content creation through sound. The produced sounds can serve as sound effects and background noises during the development of content and virtual environments. Accordingly, FALL-E is projected to be an essential sound technology in all environments offering immersive experiences.                  Allow me to explain more about FALL-E! Did the name give you a hint? It’s also an allusion to Foley sound.   Foley is a technique used in film post-production to recreate sound effects. For instance, creating the sound of horse hooves by alternately hitting two bowls on the ground. The term originated in the 1930s, named after Jack Foley.   Foley sound creation is vital in content production. However, reusing recorded sounds and associated economic inefficiencies remain challenges, resulting in a continued reliance on manual work.   As a result, addressing this issue with a generative model presents a promising approach – an approach Gaudio Lab is focused on.           So, what were the challenges we encountered during this research?   In preparing this paper, Gaudio Lab conducted a survey involving professionals from the film sound industry, and the responses were incorporated into our research. To provide a brief overview of our findings, the most significant hurdles were: 1) the need for superior sound quality, and 2) the ability to exercise detailed control. [Link to the full paper]     How exactly did Gaudio Lab overcome these challenges during FALL-E’s development?   One significant hurdle was the scarcity of clean, high-quality data. This issue was compounded by the large quantity of data needed by generative models. Gaudio Lab’s solution was to use both clean and relatively noisier data in tandem, incorporating specific conditions into the model.   Generative models draw from more than just the samples they aim to create; they also use a variety of supplementary information, such as text, categories, and videos, as learning inputs. We added another layer of context to this process by tagging each dataset based on its source.   This method enabled the model to decide whether to generate clean or noisy sounds during the production process. Indeed, during our participation in the DCASE Challenge, our model earned praise for its ability to produce a wide array of sounds while maintaining high audio quality.   The competition involved selecting the top four contenders per track using an objective evaluation metric (FAD), which was followed by a listening evaluation. As evidenced by the results, Gaudio Lab scored highly in categories assessing both clean sound quality and the ability to generate a diverse range of sounds.   These impressive results were achieved even though we only participated in certain categories of the competition with our FALL-E model, which is capable of generating all types of sounds. [image = DCASE 2023 Challenge Task 7 Results] :: You can see more details here.     Although FALL-E could be considered the best-performing model in terms of sound quality among those currently available, we do not intend to stop here. Gaudio Lab is continuously exploring ways to develop models that can generate even more superior sounds.     In truth, the birth of FALL-E was not without its trials and tribulations.   When Gaudio Lab first conceived the idea of FALL-E and initiated the research in 2021, there were scarcely any scholarly papers addressing text-based AI Foley synthesis models. Furthermore, research into video-based sound effects was somewhat limited, and the performance of such models didn’t seem particularly promising. (It is worth mentioning that the landscape has significantly changed since then with a plethora of relevant research now available.)   At times, we found ourselves pondering the direction of our research. Yet, it was Keunwoo’s (Link) effective leadership that helped harness the team’s collective energy. He leveraged the accumulated knowledge and experiences we had gathered over time, leading us to press forward despite initial doubts and concerns. Looking back, I can’t help but think that this continuous process of defining and adjusting our direction was a crucial step in discovering the ‘right path.’   As a team, we enthusiastically threw ourselves into the project, pausing occasionally to fine-tune our trajectory. We remained flexible, adjusting our course in response to new challenges, and ultimately reaching our intended goal. This, I believe, is an apt representation of Gaudio Lab’s AI Research team’s approach to work.   Before we knew it, we were not only organizing the DCASE but also participating in it. We were thrilled to achieve excellent results, even though our participation was rather casual. Ultimately, I find myself in Hawaii. Initially, the decision to visit Hawaii was taken with the idea of expanding our horizons, even if our paper was not accepted. However, our paper’s acceptance at the ICML conference has made this trip all the more meaningful. I look forward to fully immersing myself in this exciting and fruitful journey before returning to Korea.   And on that note, I conclude my updates from Hawaii🏝!      

Introducing the first audiobook featuring sound created by AI! Wait, are you telling me AI made this

Looking for that perfect sound? AI can now create it for you!     Hi, I'm Bright, a sound engineer at Gaudio Lab 🌟!   For sound engineers who need to produce a range of sounds, creating sound effects is often a tedious job that can eat up nearly a third of the workday. It’s like searching for a needle in a haystack—or in our case, trying to find a “warm iced coffee” or a design that’s both modern and classic. The odds of finding the ideal sound with just one search are pretty low.   So, people like me, let’s call us “sound nomads,” used to spend our days digging through extensive sound libraries to find the right effect for each scene. But that was then, and this is now.   In this age of advanced AI like ChatGPT, we asked ourselves, “Why can’t AI generate sound?” Now, we’re thrilled to introduce Generative Sound AI FALL-E! (*applause* 👏🏻)                 And guess what? This year, FALL-E’s creations were featured in an audiobook for the first time. Curious about how that happened?     Introducing the first audiobook enhanced with sound effects created by generative AI!     Gaudio Lab is providing its technology to a special summer thriller collection called <Incident Reports>. The audiobook is already making waves, especially since it’s directed by Kang Soo-Jin, a famous voice actor known for iconic roles like Detective Conan and Sakuragi Hanamichi from the Slam Dunk series. We’ve used Gaudio Lab’s spatial audio tech to really bring out the thriller vibes, and reviews say it makes the experience super immersive and real. (I can’t help but be proud, I worked on this!)   But here’s the twist: this year’s <Incident Reports> doesn’t just feature Gaudio Lab’s audio technology. We’ve also thrown in sound effects generated by FALL-E, a state-of-the-art sound AI. That makes it the very first audiobook to use AI-produced sound effects!   Even as a seasoned sound engineer, I was amazed by the sound quality that the AI managed to deliver. I found myself wondering, “Can these AI-created sounds really match up to recorded ones?” And you know what? They absolutely can.   Curious? Want to hear for yourself? (Here are some generated sounds of thunder and lightning, for example.)             What did you think? I have to say, I was completely blown away when I first listened to the sample. Honestly, as a sound engineer, the thought of AI creating sounds had me a bit concerned about the potential quality. But you’ve heard it yourself, right? It delivers such high quality sound that you can’t tell the difference from real recordings. And what sets FALL-E apart is its ability to create just about any sound, unlike other generative AIs that are limited to certain noises. (And can you believe it was developed with over 100,000 hours of training data?)       Don't miss behind-the-scenes video!     Let me dive a bit deeper into the <Incident Reports> for you. (And don’t miss the behind-the-scenes details in the above video!)   Taking the piece titled <Baeksi (白視) - The End of the Snowstorm> as an example from this project, the narrative unfolds on a snowy mountain in the middle of a ferocious blizzard. Now, within this setting, you’ve got several elements, or let’s call them sound objects, like snow plows, the blizzard, and avalanches making an appearance. Under normal circumstances, I would’ve been searching libraries and comparing tons of sounds, spending counless hours to match these elements. Also, sounds of a snow plow or an avalanche aren’t something you come across every day, so creating these noises would have been quite difficult. I would likely have been digging through sound effect libraries and possibly blending various sounds from different sources to get it just right.   However, when I gave FALL-E a prompt to generate these sounds, it managed to create these sound effects for me.             Creating Unique Sounds with Generative AI   Using FALL-E, a generative AI model, we can craft brand new sounds just by entering a simple prompt. It’s not only saved me a massive amount of time (no more sifting through libraries for that perfect sound!), but it also generates a unique sound every time, helping me create that one sound in the world I’ve always wanted.   I have a soft spot for the sound of rain, and I’ve actually created several different versions of my favorite rain sounds. Want to hear how they change with each new prompt during the generation process? Give this a listen:             Wrapping up...   While working on sound projects, I find myself chatting with FALL-E, a remarkable AI that generates sounds, instead of spending tedious hours of mouse clicks searching through sound effect libraries. It’s honestly quite surprising to be living in this day and age, an idea I only dreamed about during my early days as a sound engineer. I’ve always hoped for a tool that would let me create the exact sounds I imagined instantly with each project.   Currently, as a sound engineer at Gaudio Lab, I’m bubbling with excitement every day, always looking forward to the day when many more people will get to experience FALL-E. ☺️ And don’t worry, we are constantly working on and refining FALL-E to make it even better for everyone. Hang tight, we’re excited to share it with you all very soon!