뒤로가기back

DCASE 2023: Gaudio Lab paved the way as always in ‘AI Olympics for Sound Generation’

2023.06.12 by Luke Bae

Introduction to DCASE

 

DCASE is an esteemed international data challenge in the field of acoustics, with prestigious institutions from around the globe participating.

 

In the world's inaugural AI Sound Generation Challenge, Gaudio Lab not only pioneered the Foley Sound Synthesis sector of the DCASE (an acronym for Detection and Classification of Acoustic Scenes and Events) Challenge but also managed to secure second place overall, despite participating light-heartedly.

 

Launched in 2013, the DCASE competition, marking its ninth year now, holds a stature equivalent to the 'Olympics' in the field of Sound AI. Coinciding with the advent of the AI era, a sound generation category was introduced for the first time in this edition of the competition. The contest saw participation from not just global corporations like Gaudio Lab, Google, Sony, Nokia, and Hitachi, but also renowned global universities such as Carnegie Mellon University, University of Tokyo, Seoul National University, and KAIST. This led to a stage brimming with cutting-edge competition in the Sound AI sector. With 123 teams applying across seven projects, a total of 428 submissions were received, highlighting the intensity of the competition.

 

 

World's First AI Sound Generation Challenge: Foley Sound Synthesis Challenge

 

The 'Foley Sound Synthesis' task, a new entrant in the field of generative AI, garnered substantial interest this year. This task specifically involved the generation of sounds belonging to specific categories (like cars, sneezes, etc.), utilizing AI technology and data. Given our extensive experience in this sector, Gaudio Lab served as the organizer, setting the direction for the task. Despite our nonchalant participation, we accomplished the feat of securing second place. Notably, in the evaluation of 'sound diversity', which is considered an essential metric from a commercialization perspective, Gaudio Lab received scores significantly higher than our direct competition.

 

[Figure 1] Overview and Organizers of the DCASE 2023 Foley Sound Synthesis Challenge

 

 

Remarks on DCASE 2023

 

You might be curious as to how Gaudio Lab, a small Korean startup, was able to organize this competition and even stand tall on the podium among renowned global corporations and world-class universities. This success can be attributed to Gaudio Lab's foresight and early beginnings in the field of generative AI research and development, coupled with the relentless efforts of our AI researchers working diligently behind the scenes. Now that we've boasted enough, let's turn the spotlight to the real heroes.

 

[Figure 2] DCASE Ranking Announcement Screen, 'Chon_Gaudio' presents the results submitted by Gaudio Lab.

 

What does DCASE mean to Gaudio Lab?

 

Ben Chon: Gaudio Lab embarked on a journey of researching and developing sound generation AI with the ambitious goal of reproducing all sounds in the world as early as 2021, well before ChatGPT became a sensation (see [Figure 4] for reference). After extensive research, we achieved Category-to-Sound generation in June 2022, a concept akin to this DCASE challenge. Since then, we've dedicated ourselves to the more ambitious objectives of (arbitrary) Text-to-Sound and (arbitrary) Image-to-Sound research to attain a commercial-level implementation outside of the lab, where significant strides have already been made. We ultimately envision our Video-to-Sound generation model to become an indispensable solution in every scenario where sound is needed - not only in traditional media such as movies and games but also in next-generation media platforms like the metaverse, by creating an apt sound for any form of input.

 

[Figure 3] Evolution of Generative Sound AI, Gaudio Lab is at Difficulty Level 3

 

Compared to Gaudio Lab's AI, whose goal is to generate all conceivable sounds, the Category-to-Sound model as mandated by DCASE is somewhat restrictive, narrowing the scope of sound generation to a handful of categories. As a result, this category seemed to be somewhat of a small playground for Gaudio Lab's technological prowess. In this competition, over 30 technological entries were submitted.

 

There were moments of solitude, questioning if we were the only ones spearheading this niche. However, being the organizers of the competition allowed us to stimulate research in this field, and in the process, reaffirm the global stature of our technology, which has been a meaningful experience. As we are at the forefront of commercialization, we intend to sustain our leadership in this market by learning from the research outcomes of other participants.

 

[Figure 4] The cover of the materials from the kickoff meeting for SSG (Sound Studio Gaudio), Gaudio Lab's sound generation AI project, a truly legendary beginning

 

What were the most significant challenges in preparing for DCASE?

 

Keunwoo Choi : As Gaudio Lab was the principal organization in this field, the most challenging aspect was balancing between the roles of an international competition organizer and Gaudio Lab's Research Director. Given that Foley Sound Synthesis was a challenge introduced in DCASE for the first time, we endeavored, as organizers, to set a positive benchmark by curating a fair and scholarly consequential competition.

Concurrently, in my capacity as Gaudio Lab's Research Director, I had to orchestrate and implement a comprehensive research plan while dividing limited computational resources. This task felt analogous to resolving an intricate puzzle. To allocate human and GPU resources efficiently, I developed detailed charts to optimize workload distribution. In retrospect, following the successful completion of the competition, the experience seems invaluable.

 

Rio Oh : Although the entire process was complex, training the LM (Language Model) based model was particularly difficult. The overall process was strenuous, mainly because the outcomes didn't always match the amount of effort we put in.

 

 

What was the most memorable moment during the preparation for DCASE?

 

Manuel Kang : The most memorable moment was when our AI successfully generated a realistic animal sound for the first time in June 2022. I was very proud to see our initial model, which started with no sound production, gradually improve to this point.

 

Monica Lee : Indeed, the moment our model produced a genuine animal sound for the first time remains unforgettable. When I played the artificially generated puppy sound at home, my pet dog, Sabine, reacted by barking and appearing confused. It seems we effortlessly passed the puppy Turing test! (Haha)

 

Rio Oh : Our generation model underwent numerous updates During the preparation process. Each time the model operated as planned without any glitches, it was immensely satisfying. Among these moments, I remember most vividly when we could control aspects like background noise and recording environment to our liking.

 

Devin Moon : Performing optimizations to capture subtle nuances in sound through prompt engineering was exciting. I distinctly remember when we generated the sound of quickly running on a creaky wooden floor within a resonant space. The generated sound was so realistic that it was still hard to distinguish from the actual sound.

 

 

What sets Gaudio Lab's generative AI apart?

 

Ben Chon : A key distinction of Gaudio Lab's AI lies in its capabilities that surpass the original parameters of the Category-to-Sound task. The AI can generate virtually any sound, extending from Text-to-Sound to Image-to-Sound. In simpler terms, while our model has the potential to create a vast array of sounds, it was constrained to generate sounds from certain categories in the context of the competition. This situation is similar to a marathon runner participating in a 100-meter sprint.

In reality, our AI can synthesize nearly every conceivable sound, ranging from various animal noises to the ambient sounds of an African savannah teeming with hundreds of species. Moreover, its ability to isolate and generate the sound of a single object without noise interference offers significant advantages when the technology is used directly in content production for films and games.

 

Keunwoo Choi : In order to develop a high-performing and versatile model, we dedicated significant effort from the outset to data collection, arguably the most crucial aspect of AI development. We systematically gathered all possible data worldwide and supplemented gaps in information with the assistance of AI tools, such as ChatGPT. Our aim was to accumulate the best quality data possible. A key initiative in our data collection strategy was the acquisition of 'Wavelab,' a top-tier film sound studio in South Korea. This step enabled us to secure high-quality data. Furthermore, our generative model's design sets Gaudio Lab's AI apart. Our model deviates from traditional AI models that specialize in music or voice, and is designed to create a wide spectrum of sounds or audio signals.

 

 

Would you mind sharing your thoughts on behalf of your team about this achievement?

 

Ben Chon: Gaudio Lab has transcended the constraints of the DCASE task to develop a Text-to-Sound model capable of producing virtually any sound. Recognition from DCASE, which operates within a select range of sound categories, is a strong testament to the maturity of Gaudio Lab's AI development capabilities, bringing us closer to a truly 'universal' sound model.

 

Our indirect validation of world-class quality across diverse sound categories, some not even covered by DCASE, gives us greater confidence for future research. I believe our team has achieved something truly remarkable. Kudos to all the researchers at Gaudio Lab for their hard work!

 

Keunwoo Choi : It is extremely rewarding to see the fruits of our continuous research and development in the vast field of generative audio AI. DCASE held its first generative audio challenge, which was relatively straightforward in terms of problem definition. However, our system was already performing well with far more intricate text prompts. I hope we can further develop and commercialize this technology, which possesses infinite possibilities, to cause a significant ripple effect in the audio industry.

 

 

Could you please share your future aspirations or vision?

 

Ben Chon : We believe it's crucial for Gaudio Lab's generative AI to secure practical use cases in the real industry, in addition to making an impact in the academic realm. Having participated in DCASE, our generative AI has grown beyond the Text-to-Sound capability to effectively handle Image-to-Sound tasks. We're also considering expanding our scope to include Video-to-Sound. With technology advancing at an astonishing pace, it's time for us to evolve our focus towards impacting people's lives directly by integrating our technology in real-world industry applications. In fact, our efforts are already bearing fruit, with ongoing discussions about potential collaborations with companies in forward-looking sectors such as film production and the metaverse.

I am eager to ensure that Gaudio Lab leads both technological advancement and commercialization, aiming for a future where we stand at the heart of global sound production. We appreciate your continued interest and support for Gaudio Lab's AI technology!

 

 

In closing,

 

I am truly delighted to announce that the relentless efforts of the Gaudio Lab researchers, who have silently navigated uncharted territories, can now be proudly showcased on the global stage. Until we realize our vision of "All sounds in the world originating from Gaudio Lab," we ask for your continued interest and support for Gaudio Lab's AI technology.

pre-image
Is there AI that creates sounds? : Sound and Generative AI

The Surge of Generative AI Brought by ChatGPT   (Writer: Keunwoo Choi)   The excitement surrounding generative AI is palpable, as evidenced by its widespread integration like ChatGPT into our daily lives. It feels like we're witnessing the dawn of a new era, similar to the early days of smartphones. The enthusiasm for generative AI, initially sparked by ChatGPT, has since spread to other domains, including art and music.   Generative AI in the Field of Sound   Sound is no exception; Generative AI has made significant advances, particularly in AI-based voice synthesis and music composition. However, when we consider the sounds that make up our everyday environments, voices and music are only a small part of the equation. It's the little sounds like the clicking of a keyboard, the sound of someone breathing, or the hum of a refrigerator that truly shapes our auditory experiences. Without these sounds, even the most finely crafted voices and music would fail to capture the essence of our sonic world fully.     Despite their significance, we have yet to see an AI that can generate the multitude of sounds that make up our sonic environment, which we'll refer to as 'Foley Sounds' for the sake of explanation. The reason for this is quite simple: it's dauntingly challenging. To generate every sound in the world, an AI must have access to data representing every sound in the world. This requires the consideration of numerous variables, making the task incredibly complex.   Knowing the difficulties involved, Gaudio Lab has taken up the challenge of generating all of the sounds in the world. In fact, we began developing an AI capable of doing so even before generative AI became a hot topic in 2021.    Without further ado, let's listen to the demo they have created.   AI-Generated Sounds vs. Real Sounds       How many right answers did you get?  As you could hear in the demo, the quality of AI-generated sound now easily surpasses the common expectation.   Finally, It is time to uncover how these incredibly realistic sounds are generated.   Gaudio Lab generates sounds through AI   Visualizing Sounds: Waveform Graph   Before we can delve into the process of generating sounds with AI, it's essential to understand how sounds are represented. You may have come across images like this before:     The graph above illustrates the waveform of sound over time, which enables us to estimate when and how loud a sound occurs, but not its specific characteristics.   Visualizing Sounds: Spectrogram   The spectrogram was developed to overcome these limitations.     At first glance, the spectrogram already appears to contain more information than the previous graph.   The x-axis represents time, the y-axis represents frequency, and the color of each pixel indicates the amplitude of the sound. Essentially, a spectrogram can be seen as the DNA of sound, containing all the information about a particular sound. Therefore, if a tool can convert a spectrogram into an audio signal, creating sound is equivalent to generating an image. This simplifies many tasks, as it allows for the use of the similar diffusion-based image generation algorithm employed by OpenAI's DALL-E 2.   Now, do you see why we explained the spectrogram? Let's take a closer look at the actual process of creating sound using AI.   Creating Sound with AI: The Process Explained       Step 1: Generating Small Spectrograms from Text Input   The first step in creating sound with AI involves processing the input that describes the desired sound. For example, when given a text input such as "Roaring thunder" the diffusion model generates a small spectrogram from random noise. This spectrogram is made up of a 16x64 pixel image, representing 16 frequency bands and 64 frames. Although it may appear too small to be useful, even a small spectrogram can contain significant information about a sound.   Step 2: Super Resolution   The image then undergoes a 'Super Resolution' phase, where the diffusion model iteratively improves the resolution through multiple stages, resulting in a clear and detailed spectrogram as shown earlier.   Step 3: Vocoder   This final step involves converting the spectrogram into an audio signal using a vocoder. However, most market-available vocoders are designed to work with voice signals, making them unsuitable for a wide range of sounds. To address this limitation, Gaudio Lab developed its own vocoder, which has achieved world-class performance levels. Furthermore, Gaudio Lab plans to release this vocoder as an open-source tool in the first half of 2023.   Gaudio Lab’s World-class Sound Generation AI   What makes developing sound-generation AI challenging?   While the process may appear straightforward at first glance, producing realistic sounds with AI requires addressing numerous challenges. In fact, AI is only a tool, and solving the actual problem demands the expertise of individuals in the field of audio.   Handling and managing large amounts of audio data is one of the biggest challenges in creating AI-generated sound. For instance, the size of Gaudio Lab's training data is approximately 10TB, corresponding to about 10,000 hours of audio. This requires a high level of expertise to collect and manage the data, as well as the ability to efficiently load the audio data for training, in order to minimize I/O overhead. In comparison, ChatGPT's training data is known to be around 570GB, and ImageNet, a dataset that has driven the progress of deep learning in computer vision, is only about 150 GB.   Evaluating AI models for audio is also difficult because it requires listening to the generated audio in its entirety, which is time-consuming and can be influenced by the listening environment. This makes it challenging to determine the quality of the generated audio objectively.   When it comes to sound generation AI, expertly developed AI models produce better results.   Having a team of experts in audio engineering is undoubtedly an advantage for Gaudio Lab. Our expertise and knowledge of audio help to ensure that the AI models generate high-quality and realistic sounds. Additionally, our experience in audio research at globally renowned companies allows Gaudio Lab to stay up-to-date with the latest audio technologies and trends. Their participation in the listening evaluation process ensures that the generated audio meets high standards, making Gaudio Lab's sound generation AI a unique and valuable asset.   The Evolution of Sound Generation AI Gaudio Lab's ultimate goal is to contribute to the development of the metaverse by filling it with sound using their AI technology. While the current performance of their sound generation AI is impressive, they acknowledge the need for further development to capture all the sounds in the world.   Participation in the DCASE 2023 Challenge   Participating in DCASE is a great opportunity for Gaudio Lab to showcase its exceptional sound generation AI and compete with other top audio research teams from around the world. The evaluation process in DCASE will likely involve objective metrics such as signal-to-noise ratio, perceptual evaluation of speech quality, and speech intelligibility, as well as subjective evaluations where human listeners will provide feedback on the generated sounds. The results of DCASE will provide valuable insights and feedback for Gaudio Lab to continue improving its AI models and enhance the quality of the generated sounds. Please wish Gaudio Lab the best of luck in DCASE and look forward to hearing positive updates about our progress. We'll be releasing AI-generated sounds soon, so stay tuned!   

2023.04.18
after-image
Integrating Audio AI SDK with WebRTC (1): A Look Inside WebRTC's Audio Pipeline

Integrating Audio AI SDK with WebRTC (1): A Look Inside WebRTC's Audio Pipeline (Writer: Jack Noh)   Curious About WebRTC?   The MDN documentation describes WebRTC (Web Real-Time Communication) in the following manner.(It should be noted that the MDN documentation is essentially a standard reference that anyone engaged in web development will inevitably encounter.)   WebRTC (Web Real-Time Communication) is a technology enabling web applications and sites to capture and freely stream audio or video media between browsers, eliminating the need for an intermediary. In addition, it permits the exchange of arbitrary data. The series of standards composing WebRTC facilitate end-to-end data sharing and video conferencing, all without requiring plugins or the installation of third-party software.   To simplify, WebRTC is a technology that allows your browser to communicate in real-time with only an internet connection, which eliminates the need to install any extra software. Services exemplifying the use of WebRTC include Google Meet, a video conferencing service, and Discord, a voice communication service. (This technology also gained substantial attention during the outbreak of Covid-19!) As an open-source project and web standard, WebRTC's source code can also be accessed and modified via the following link.     Understanding WebRTC's Audio Pipeline   WebRTC is a comprehensive multimedia technology, encompassing diverse technologies such as audio, video, and data streams. In this article, I aim to delve into aspects related to WebRTC's audio technology.   If you've had experience using a WebRTC-enabled video conferencing or voice call web application (e.g., Google Meet), you might be intrigued to understand how the Audio pipeline is structured. The Audio pipeline can be separated into two distinct Streams. Firstly, 1) the Stream of voice data captured from the microphone device and transmitted to the other party, and concurrently, 2) the Stream that receives the other party's voice data and outputs it via the speaker. These are respectively referred to as the Near-end Stream (sending the microphone input signal to the other party) and the Far-end Stream (outputting the audio data received from the other party through the speaker). We'll take a closer look at each Stream, which consists of five steps, in the sections below.   1) Near-end Stream (Transmitting Microphone Input Signal to the Receiver) Audio signals are received from the microphone device. (ADM, Audio Device Module) Enhancements are applied to the input audio signal to augment call quality. (APM, Audio Processing Module) If there are other audio signals (e.g., file streams) to be concurrently transmitted, they are integrated using an Audio Mixer. The audio signal is subsequently encoded. (ACM, Audio Coding Module) The signal is converted into RTP packets and dispatched through UDP Transport. (Sending)   2) Far-end Stream (Projecting the Received Audio Data from the Sender through the Speaker) Audio data in the form of RTP packets is received from the connected peers (multiple Peers). (Receiving) Each RTP packet is decoded. (ACM, Audio Coding Module) The decoded multiple streams are merged into a single stream by an Audio Mixer. Enhancements are applied to the output audio signal to augment call quality. (APM, Audio Processing Module) The audio signal is eventually outputted through the speaker device. (ADM, Audio Device Module) The names of the modules responsible for each stage of the process are noted in brackets on the right in the preceding descriptions. WebRTC exhibits this level of modularization for each process.   Here are more detailed explanations for each module: ADM (Audio Device Module): This interfaces with the input/output hardware domain, facilitating the capture/render of audio signals. It's implemented using APIs tailored to the respective platform (Windows, MacOS, etc.). APM (Audio Processing Module): This comprises a set of audio signal processing filters designed to boost call quality. It's primarily employed on the client end. Audio Mixer: This consolidates multiple audio streams. ACM (Audio Coding Module): This executes the encoding/decoding of audio for transmission/reception. The aforementioned process can be visualized as shown in the following diagram.     As previously described, the audio pipeline in WebRTC is notably modular, with its functionalities neatly divided.     Enhancing WebRTC Audio Quality with Gaudio SDK   Gaudio Lab houses several impressive and practical audio SDKs, such as GSA (Gaudio Spatial Audio), GSMO (Gaudio Sol Music One), and LM1 (a volume normalization standard based on TTA). The idea of developing applications or services using these SDKs, thus delivering superior auditory experiences to users, is indeed captivating.   (Did you know?) Gaudio Lab boasts an SDK that fits seamlessly with WebRTC – The GSEP-LD, a noise reduction feature that operates on AI principles. Interestingly, it offers real-time functionality with minimal computational demand (and provides top-tier performance on a global scale!)   We often endure discomfort due to ambient noise while conducting video conferences. To alleviate such noise-related concerns, WebRTC incorporates a noise suppression filter rooted in signal processing. (As a point of interest, WebRTC already contains filters beyond noise suppression to improve call quality!) This noise suppression filter forms part of the previously mentioned APM (Audio Processing Module).   Imagine the potential improvements if we replaced the conventional signal processing-based noise suppression filter with Gaudio Lab's AI-driven noise suppression filter. However, despite the eagerness to instantly substitute the existing noise suppression filter with GSEP-LD, it is crucial to proceed with caution. Attempting hasty integration (or replacement) in such a complex, large-scale project can generate complications, as raised by the following considerations:   Does the original performance of GSEP-LD measure up well? → It is important to verify the quality of the original performance. Could there be any side effects with the existing signal processing-based filters? → It is advisable to check the effects while managing other filters in WebRTC. Does the optimal point of integration align with the location of the existing noise reduction filter? → Various points of integration should be tested. Can performance be guaranteed across diverse user environments? → This requires a wide range of experimental data and consideration of different platform-specific settings. If one dives headfirst into the project solely driven by enthusiasm, they might find themselves overwhelmed by the questions mentioned earlier, causing effective integration to become increasingly challenging. To circumvent this, the primary step entails building a 'robust testing environment'. The larger the project, with its many interconnected technologies, the greater the emphasis on this requirement. However, establishing a robust testing environment is not an easy undertaking. In this article, I have discussed the audio technology of WebRTC. In the next article, I will share my experience in establishing a solid testing environment within WebRTC with relative ease. This allowed me to significantly boost my confidence in the performance as I was able to integrate GSEP-LD into the WebRTC audio pipeline.      

2023.06.26