뒤로가기back

Is there AI that creates sounds? : Sound and Generative AI

2023.04.18 by Luke Bae

The Surge of Generative AI Brought by ChatGPT

 

(Writer: Keunwoo Choi)

 

The excitement surrounding generative AI is palpable, as evidenced by its widespread integration like ChatGPT into our daily lives. It feels like we're witnessing the dawn of a new era, similar to the early days of smartphones. The enthusiasm for generative AI, initially sparked by ChatGPT, has since spread to other domains, including art and music.

 

Generative AI in the Field of Sound

 

Sound is no exception; Generative AI has made significant advances, particularly in AI-based voice synthesis and music composition. However, when we consider the sounds that make up our everyday environments, voices and music are only a small part of the equation. It's the little sounds like the clicking of a keyboard, the sound of someone breathing, or the hum of a refrigerator that truly shapes our auditory experiences. Without these sounds, even the most finely crafted voices and music would fail to capture the essence of our sonic world fully.

 

 

Despite their significance, we have yet to see an AI that can generate the multitude of sounds that make up our sonic environment, which we'll refer to as 'Foley Sounds' for the sake of explanation. The reason for this is quite simple: it's dauntingly challenging. To generate every sound in the world, an AI must have access to data representing every sound in the world. This requires the consideration of numerous variables, making the task incredibly complex.

 

Knowing the difficulties involved, Gaudio Lab has taken up the challenge of generating all of the sounds in the world. In fact, we began developing an AI capable of doing so even before generative AI became a hot topic in 2021. 

 

Without further ado, let's listen to the demo they have created.

 

AI-Generated Sounds vs. Real Sounds

 

 

 

How many right answers did you get? 

As you could hear in the demo, the quality of AI-generated sound now easily surpasses the common expectation.

 

Finally, It is time to uncover how these incredibly realistic sounds are generated.

 

Gaudio Lab generates sounds through AI

 

Visualizing Sounds: Waveform Graph

 

Before we can delve into the process of generating sounds with AI, it's essential to understand how sounds are represented. You may have come across images like this before:

 

 

The graph above illustrates the waveform of sound over time, which enables us to estimate when and how loud a sound occurs, but not its specific characteristics.

 

Visualizing Sounds: Spectrogram

 

The spectrogram was developed to overcome these limitations.

 

 

At first glance, the spectrogram already appears to contain more information than the previous graph.

 

The x-axis represents time, the y-axis represents frequency, and the color of each pixel indicates the amplitude of the sound. Essentially, a spectrogram can be seen as the DNA of sound, containing all the information about a particular sound. Therefore, if a tool can convert a spectrogram into an audio signal, creating sound is equivalent to generating an image. This simplifies many tasks, as it allows for the use of the similar diffusion-based image generation algorithm employed by OpenAI's DALL-E 2.

 

Now, do you see why we explained the spectrogram? Let's take a closer look at the actual process of creating sound using AI.

 

Creating Sound with AI: The Process Explained

 

 

 

Step 1: Generating Small Spectrograms from Text Input

 

The first step in creating sound with AI involves processing the input that describes the desired sound. For example, when given a text input such as "Roaring thunder" the diffusion model generates a small spectrogram from random noise. This spectrogram is made up of a 16x64 pixel image, representing 16 frequency bands and 64 frames. Although it may appear too small to be useful, even a small spectrogram can contain significant information about a sound.

 

Step 2: Super Resolution

 

The image then undergoes a 'Super Resolution' phase, where the diffusion model iteratively improves the resolution through multiple stages, resulting in a clear and detailed spectrogram as shown earlier.

 

Step 3: Vocoder

 

This final step involves converting the spectrogram into an audio signal using a vocoder. However, most market-available vocoders are designed to work with voice signals, making them unsuitable for a wide range of sounds. To address this limitation, Gaudio Lab developed its own vocoder, which has achieved world-class performance levels. Furthermore, Gaudio Lab plans to release this vocoder as an open-source tool in the first half of 2023.

 

Gaudio Lab’s World-class Sound Generation AI

 

What makes developing sound-generation AI challenging?

 

While the process may appear straightforward at first glance, producing realistic sounds with AI requires addressing numerous challenges. In fact, AI is only a tool, and solving the actual problem demands the expertise of individuals in the field of audio.

 

Handling and managing large amounts of audio data is one of the biggest challenges in creating AI-generated sound. For instance, the size of Gaudio Lab's training data is approximately 10TB, corresponding to about 10,000 hours of audio. This requires a high level of expertise to collect and manage the data, as well as the ability to efficiently load the audio data for training, in order to minimize I/O overhead. In comparison, ChatGPT's training data is known to be around 570GB, and ImageNet, a dataset that has driven the progress of deep learning in computer vision, is only about 150 GB.

 

Evaluating AI models for audio is also difficult because it requires listening to the generated audio in its entirety, which is time-consuming and can be influenced by the listening environment. This makes it challenging to determine the quality of the generated audio objectively.

 

When it comes to sound generation AI, expertly developed AI models produce better results.

 

Having a team of experts in audio engineering is undoubtedly an advantage for Gaudio Lab. Our expertise and knowledge of audio help to ensure that the AI models generate high-quality and realistic sounds. Additionally, our experience in audio research at globally renowned companies allows Gaudio Lab to stay up-to-date with the latest audio technologies and trends. Their participation in the listening evaluation process ensures that the generated audio meets high standards, making Gaudio Lab's sound generation AI a unique and valuable asset.

 

The Evolution of Sound Generation AI

Gaudio Lab's ultimate goal is to contribute to the development of the metaverse by filling it with sound using their AI technology. While the current performance of their sound generation AI is impressive, they acknowledge the need for further development to capture all the sounds in the world.

 

Participation in the DCASE 2023 Challenge

 

Participating in DCASE is a great opportunity for Gaudio Lab to showcase its exceptional sound generation AI and compete with other top audio research teams from around the world. The evaluation process in DCASE will likely involve objective metrics such as signal-to-noise ratio, perceptual evaluation of speech quality, and speech intelligibility, as well as subjective evaluations where human listeners will provide feedback on the generated sounds. The results of DCASE will provide valuable insights and feedback for Gaudio Lab to continue improving its AI models and enhance the quality of the generated sounds. Please wish Gaudio Lab the best of luck in DCASE and look forward to hearing positive updates about our progress.

We'll be releasing AI-generated sounds soon, so stay tuned! 

 

pre-image
Improving Spatial Audio Quality with Motion-to-Sound Latency Measurement

Improving Spatial Audio Quality with Motion-to-Sound Latency Measurement (Writer: James Seo)   [Introduction: Synopsis]   Spatial Audio is an audio rendering technology that reproduces sound according to the user's position and direction when listening through headphones or earphones, creating the illusion of a natural and lifelike sound experience. The quality and performance of Spatial Audio depend not only on the rendering of spatial and directional characteristics but also on the time it takes for the sound to be rendered and played back based on the user's movement (Motion-to-Sound Latency).   - If the Motion-to-Sound Latency is too long, users may lose immersion and even experience motion sickness due to the discrepancy between the visual stimuli and auditory experience. This is the same concept as motion-to-photon latency, which causes motion sickness when using VR devices.   Measuring Motion-to-Sound Latency is essential for evaluating the user’s spatial audio experience. However, this measurement is not a simple task. The overall Motion-to-Sound Latency can be divided into several parts, as shown in Figure 1: (1) motion-to-sensor latency, which detects the user's motion, (2) sensor-to-processor latency, where the detected movement is transmitted from the sensor to the audio processor, (3) rendering latency occurring during the rendering process in the processor, and (4) communication latency that occurs when the rendered signal is transmitted through communication channels such as Bluetooth. Measuring these components independently is not easy, especially for finished products on the market, as it is impossible to break down each module inside for measurement.   In this article, we will explain a method for more accurately measuring Motion-to-Sound Latency. The content is not too difficult, so we believe you can follow along and understand it well.     Figure 1 Breakdown of Motion-to-Sound Latency     [Measurement Hypothesis: Binaural Rendering & Crosstalk Cancellation]   As mentioned earlier, Spatial Audio utilizes Binaural Rendering technology to render sound sources in a space according to the relative position of the sound source and the listener. In other words, it is a technology that aims to reproduce not only the position of the sound source but also the feeling of the space in which it exists.   Generally, Binaural Rendering is performed using Binaural Filters such as BRIRs (Binaural Room Impulse Responses) and HRIRs (Head-Related Impulse Responses).     Figure 2 HRIR(Left) vs. BRIR(Right)(HRIR: The relationship between the listener and a sound source that doesn’t contain listening room characteristics)(BRIR: The relationship between the listener and a sound source that contains listening room characteristics)   The basic concept of a Binaural Filter is a filter that defines the characteristic changes of sound coming from a specific 'position' to the left and right ears. Therefore, this Binaural Filter can be defined as a function of distance, horizontal angle, and vertical angle. The definition of the Binaural Filter may vary depending on whether it reflects spatial characteristics due to reflected sound components (BRIR) or only represents the relationship between the sound source and the user's ears (HRIR). In the case of BRIR, both direct and reflected sounds are represented in the form of a Binaural Filter, while HRIR is a Binaural Filter that considers only direct sound, excluding reflected sound. Naturally, BRIR has a much longer response length than HRIR.   Typically, Spatial Audio uses BRIR rather than HRIR, so in this article, we will explain based on BRIR.   (a) (b) Figure 3 An Example of Impulse Responses of (a) HRIR and (b) BRIR     First, you need to determine two virtual sound source positions. It is better to choose two points that belong to different sides based on the median plane (the plane that divides left and right equally; in this context, it means the plane dividing left and right with the user at the center). The reason is that this measurement method uses the crosstalk cancellation phenomenon. (You may be wondering what crosstalk cancellation phenomenon is. Well, you will naturally learn about it as you finish reading this article!)   Figure 4 An example of virtual speakers positions for M2S measurement     Once the positions of the two virtual sound sources, which are on different sides of the median plane as shown in Figure 4, are determined, you can measure two sets of transfer functions, or BRIRs, from each sound source to both ears. These sets are represented as [BRIR_LL, BRIR_LR] and [BRIR_RL, BRIR_RR]. In each set, the first letter after '_' indicates the position of the sound source (left or right), and the second letter indicates the position of the ear (left ear or right ear). So, BRIR_LL refers to the impulse response from the left speaker propagating through space and reaching the left ear, right?   With these BRIR sets, you can calculate the magnitude difference and phase difference between the Ipsilateral Ear Input Signal (the signal transmitted from the same side sound source) and the Contralateral Ear Input Signal (the signal transmitted from the opposite side sound source) for any single-frequency signal. In simpler terms, you can calculate the magnitude difference and phase difference between the sound played from the left speaker to the left ear and the sound played from the right speaker to the left ear.   By calculating the magnitude difference and phase difference of these Ipsilateral Ear Input Signals and Contralateral Ear Input Signals for a specific frequency and using them in the form of an inverse function to modify the signal of the right virtual sound source, you can either completely eliminate the sound in the left ear due to crosstalk or play a much smaller sound compared to the right. It creates a barely audible sound. This magnitude difference and phase difference can be calculated from the magnitude response and phase response of BRTF (Binaural Room Transfer Function), which is the frequency domain representation of BRIR, or can be obtained by measuring it using a specific frequency.   For example, the input signals without the magnitude difference and phase difference are as follows:     Figure 5 Uncontrolled input signal for left and right virtual speakers   In the above Figure 5, the top is the input signal of the left channel among the virtual channels, and the bottom is the input signal of the right channel among the virtual channels. They are completely identical signals. However, what if you calculate the magnitude difference and phase difference for a specific frequency from BRIR to the left ear from the left channel and the right channel, and change the magnitude and phase of the right virtual channel signal so that the final left ear input signal is offset? The input signals would then look like Figure 6.     Figure 6 Controlled input signal for left and right virtual speakers     Now, what would the input signal to the left ear be when played with an input like Figure 6? The results are shown in Figure 7.   Figure 7 An example of left ear input signal for uncontrolled and controlled input signal   In Figure 7, the solid line represents the left ear input signal when rendering the same signal to both virtual channels without adjusting the magnitude/phase difference, while the dotted line represents the left ear input signal when the adjusted magnitude and phase difference are applied to the right virtual channel signal. You can clearly see the reduced magnitude. This method is called "crosstalk cancellation." It involves adjusting the magnitude/phase difference skillfully so that the sounds transmitted from the ipsilateral and contralateral sides cancel each other out. Crosstalk cancellation occurs when the magnitude and phase differences are perfectly aligned, and if either of them does not meet the conditions, the output signal may even become larger.   When rendering an input signal like in Figure 6 and looking straight ahead, the signal entering the left ear will either not be heard or will be heard as a very faint sound. Since the BRIR has a tail-forming filter corresponding to the rear reverb, there may be some actual errors even if the exact magnitude/phase difference is calculated. However, I can say that the sound produced at this time has the smallest magnitude.     [Measurement Method & Results]   Figure 8 Block diagram for M2S measurement     Figure 8 shows the process of measuring M2S (Motion-to-Sound) Latency. The earlier explanation was about how to generate input signals for measuring M2S Latency and corresponds to the [input signal control stage] in the figure above. The generated signal is smsr. Now, let's actually measure the M2S Latency. The input signal smsr is fed into the Spatial Audio Renderer, which receives information about movement from a device such as TWS (True Wireless Stereo) or another IMU-equipped device and performs spatial audio rendering accordingly.   First, we assume that the TWS detects the user's movement and sends the information. In the absence of user movement, the input signals to the left or right ears of the binaural output signal are either silent or playing a relatively very quiet sound due to crosstalk cancellation. When you rotate the TWS using a motor or similar at t=t0, the TWS detects the movement (in reality, the user's movement) and sends the corresponding movement information to the Spatial Audio Renderer. The Spatial Audio Renderer then renders and generates the output signal accordingly, which is played back through the TWS. When you capture the played sound, you can see the change in the envelope of the rendered signal as the crosstalk cancellation condition breaks, and from that, you can measure the M2S Latency.   However, there may be situations where you cannot directly acquire the output signal of the Spatial Audio Renderer due to environmental constraints. In such cases, you can use an external microphone to record and acquire the signal. There may be the influence of external noise in this case, but if you use a specific frequency, you can remove the noise using a bandpass filter. I'll explain in more detail through the following figure.     Figure 9 Recorded signal before(upper) and after(lower) bandpass filtering   The top image in Figure 9 is the original measurement signal. The 'Moving Start' in Figure 9 corresponds to t0 in Figure 8. That is, the section marked 'Moving Start' represents the stationary state. In the stationary state, the Ear Input Signal for that direction is almost inaudible due to Crosstalk Cancellation. From the 'Moving Start' moment, the microphone records both the noise of the motor operating and the actual rendering signal; however, in the top image, the rendered signal is small, and the noise is relatively large, making it impossible to know when the crosstalk cancellation disappears. In this experiment, we used a 500 Hz pure tone as the input signal. Thus, we only need to look at the 500 Hz signal, so if we pass the signal in the top column through a bandpass filter with fc=500 Hz, we can cleanly eliminate the motor noise (using the bandpass filter mentioned in the previous paragraph to remove noise). The result is the bottom image in Figure 9. After starting to move and after a certain time, as the crosstalk cancellation condition breaks, you can see the envelope of the recorded signal increase. That is, the point marked as "crosstalk cancellation disappears" corresponds to t1 in Figure 8. Therefore, we can calculate the M2S Latency as t1-t0.   There are various ways to find the point where the envelope increases. You could simply find the section where the recorded signal's sample values increase, but that would be too inaccurate. If perfect cancellation does not occur, the sample values will continue to change even during the cancellation. Instead of using individual sample values, you can divide the recorded signal into sections of a certain length, then calculate the variance of the sample values included in each section. To calculate the envelope's variance, it is essential to select the length of each section. A shorter time interval provides higher time precision in the time domain but must be longer than the input signal frequency's period. That is, the minimum length of each section must be longer than the input signal frequency's period. In the example above, we used a 500 Hz input signal, meaning we need at least a 2ms time interval to obtain the envelope's change. In other words, with a 500 Hz input signal, the maximum precision is 2 ms. If you want higher resolution, you can use a higher frequency input signal. However, be cautious as using too high a frequency input signal may slightly increase the error range for the magnitude/phase difference that can be calculated from the filter. In addition, you can measure the M2S Latency by extracting the sparse envelope from the measured signal, calculating the actual slope of the envelope, and using the point where the slope changes drastically as the basis. Depending on the measurement environment and the recording results, you can selectively change and use this method.   Ultimately, as shown in Figure 9, we can measure the M2S Latency based on the originally recorded signal and the signal bandpass-filtered with fc=500 Hz. This latency includes all latencies occurring in the processes where all related information is exchanged, as shown in the earlier diagram. Therefore, this latency will be the actual latency experienced by the user.     [So when we actually measured the Latency…]   These days, newly released TWS (True Wireless Stereo) devices come equipped with spatial audio features that not only add a sense of space to the sound but also provide rendering functions that respond to users' head movements. As you may know, Gaudio Lab is a company boasting both the original and best technology in the field of spatial audio. To determine how quickly each product can respond to user movements and provide good quality, we measured the M2S (Motion-to-Sound) Latency of TWS devices from various manufacturers. The measurement values are based on the average of at least 10 measurements, and we have also recorded the standard deviation.   Using the measurement method described above, we measured the M2S Latency of various TWS devices, and the example values can be found in Table 1 below.   <Table 1 M2S latencies for different TWS [unit: ms]>   Apple Airpods Pro[1] Samsung Galaxy Buds 2 Pro[2] Gaudio Spatial Audio Mockup[3] Average 124.2 203 61.1 Standard deviation 12.18 12.61 6.17     The results were surprising. Gaudio Lab's technology (even though it is still at the mock-up level) recorded a significantly lower Motion-to-Sound Latency! The reason for this is that Gaudio Lab's Spatial Audio Mock-up is based on the world's best Spatial Audio rendering optimization technology and operates on TWS devices. This eliminates the Bluetooth Communication Latency that is necessary for the smartphone rendering method used by other major TWS producers.   At the beginning of this article, we mentioned that the overall quality of Spatial Audio is determined not only by the sound quality of rendering space and direction characteristics but also by the time it takes for the sound to be reproduced from user movements (Motion-to-Sound Latency).   Through this article, which began by explaining how to measure the Motion-to-Sound Latency that enhances the quality of spatial audio, we were able to confirm that Gaudio Lab's technology records overwhelmingly outstanding figures.   You might be curious about the sound quality since the latency is at the highest level. In the next article, we plan to reveal the sound quality evaluation experiment, which showed surprising results following latency, along with sound samples. So, stay tuned~!   Ah! It's no surprise that Gaudio Lab won two innovation awards at CES 2023! Haha.     ------ [1] Measurement based on rendering using an iPhone 11 and AirPods Pro. The actual rendering takes place on the iPhone, resulting in significant delay due to communication latency between the phone and TWS. [2] Measurement based on rendering using a Galaxy Flip4 and Galaxy Buds 2 Pro. The actual rendering takes place on the Galaxy, resulting in significant delay due to communication latency between the phone and TWS. [3] Measurement result of the mock-up eliminates communication latency by implementing it on the TWS chipset produced by Gaudio Lab with the iPhone 11. The iPhone is used as a source device, and Spatial Audio Rendering is performed on the TWS.    

2023.04.10
after-image
DCASE 2023: Gaudio Lab paved the way as always in ‘AI Olympics for Sound Generation’

Introduction to DCASE   DCASE is an esteemed international data challenge in the field of acoustics, with prestigious institutions from around the globe participating.   In the world's inaugural AI Sound Generation Challenge, Gaudio Lab not only pioneered the Foley Sound Synthesis sector of the DCASE (an acronym for Detection and Classification of Acoustic Scenes and Events) Challenge but also managed to secure second place overall, despite participating light-heartedly.   Launched in 2013, the DCASE competition, marking its ninth year now, holds a stature equivalent to the 'Olympics' in the field of Sound AI. Coinciding with the advent of the AI era, a sound generation category was introduced for the first time in this edition of the competition. The contest saw participation from not just global corporations like Gaudio Lab, Google, Sony, Nokia, and Hitachi, but also renowned global universities such as Carnegie Mellon University, University of Tokyo, Seoul National University, and KAIST. This led to a stage brimming with cutting-edge competition in the Sound AI sector. With 123 teams applying across seven projects, a total of 428 submissions were received, highlighting the intensity of the competition.     World's First AI Sound Generation Challenge: Foley Sound Synthesis Challenge   The 'Foley Sound Synthesis' task, a new entrant in the field of generative AI, garnered substantial interest this year. This task specifically involved the generation of sounds belonging to specific categories (like cars, sneezes, etc.), utilizing AI technology and data. Given our extensive experience in this sector, Gaudio Lab served as the organizer, setting the direction for the task. Despite our nonchalant participation, we accomplished the feat of securing second place. Notably, in the evaluation of 'sound diversity', which is considered an essential metric from a commercialization perspective, Gaudio Lab received scores significantly higher than our direct competition.   [Figure 1] Overview and Organizers of the DCASE 2023 Foley Sound Synthesis Challenge     Remarks on DCASE 2023   You might be curious as to how Gaudio Lab, a small Korean startup, was able to organize this competition and even stand tall on the podium among renowned global corporations and world-class universities. This success can be attributed to Gaudio Lab's foresight and early beginnings in the field of generative AI research and development, coupled with the relentless efforts of our AI researchers working diligently behind the scenes. Now that we've boasted enough, let's turn the spotlight to the real heroes.   [Figure 2] DCASE Ranking Announcement Screen, 'Chon_Gaudio' presents the results submitted by Gaudio Lab.   What does DCASE mean to Gaudio Lab?   Ben Chon: Gaudio Lab embarked on a journey of researching and developing sound generation AI with the ambitious goal of reproducing all sounds in the world as early as 2021, well before ChatGPT became a sensation (see [Figure 4] for reference). After extensive research, we achieved Category-to-Sound generation in June 2022, a concept akin to this DCASE challenge. Since then, we've dedicated ourselves to the more ambitious objectives of (arbitrary) Text-to-Sound and (arbitrary) Image-to-Sound research to attain a commercial-level implementation outside of the lab, where significant strides have already been made. We ultimately envision our Video-to-Sound generation model to become an indispensable solution in every scenario where sound is needed - not only in traditional media such as movies and games but also in next-generation media platforms like the metaverse, by creating an apt sound for any form of input.   [Figure 3] Evolution of Generative Sound AI, Gaudio Lab is at Difficulty Level 3   Compared to Gaudio Lab's AI, whose goal is to generate all conceivable sounds, the Category-to-Sound model as mandated by DCASE is somewhat restrictive, narrowing the scope of sound generation to a handful of categories. As a result, this category seemed to be somewhat of a small playground for Gaudio Lab's technological prowess. In this competition, over 30 technological entries were submitted.   There were moments of solitude, questioning if we were the only ones spearheading this niche. However, being the organizers of the competition allowed us to stimulate research in this field, and in the process, reaffirm the global stature of our technology, which has been a meaningful experience. As we are at the forefront of commercialization, we intend to sustain our leadership in this market by learning from the research outcomes of other participants.   [Figure 4] The cover of the materials from the kickoff meeting for SSG (Sound Studio Gaudio), Gaudio Lab's sound generation AI project, a truly legendary beginning   What were the most significant challenges in preparing for DCASE?   Keunwoo Choi : As Gaudio Lab was the principal organization in this field, the most challenging aspect was balancing between the roles of an international competition organizer and Gaudio Lab's Research Director. Given that Foley Sound Synthesis was a challenge introduced in DCASE for the first time, we endeavored, as organizers, to set a positive benchmark by curating a fair and scholarly consequential competition. Concurrently, in my capacity as Gaudio Lab's Research Director, I had to orchestrate and implement a comprehensive research plan while dividing limited computational resources. This task felt analogous to resolving an intricate puzzle. To allocate human and GPU resources efficiently, I developed detailed charts to optimize workload distribution. In retrospect, following the successful completion of the competition, the experience seems invaluable.   Rio Oh : Although the entire process was complex, training the LM (Language Model) based model was particularly difficult. The overall process was strenuous, mainly because the outcomes didn't always match the amount of effort we put in.     What was the most memorable moment during the preparation for DCASE?   Manuel Kang : The most memorable moment was when our AI successfully generated a realistic animal sound for the first time in June 2022. I was very proud to see our initial model, which started with no sound production, gradually improve to this point.   Monica Lee : Indeed, the moment our model produced a genuine animal sound for the first time remains unforgettable. When I played the artificially generated puppy sound at home, my pet dog, Sabine, reacted by barking and appearing confused. It seems we effortlessly passed the puppy Turing test! (Haha)   Rio Oh : Our generation model underwent numerous updates During the preparation process. Each time the model operated as planned without any glitches, it was immensely satisfying. Among these moments, I remember most vividly when we could control aspects like background noise and recording environment to our liking.   Devin Moon : Performing optimizations to capture subtle nuances in sound through prompt engineering was exciting. I distinctly remember when we generated the sound of quickly running on a creaky wooden floor within a resonant space. The generated sound was so realistic that it was still hard to distinguish from the actual sound.     What sets Gaudio Lab's generative AI apart?   Ben Chon : A key distinction of Gaudio Lab's AI lies in its capabilities that surpass the original parameters of the Category-to-Sound task. The AI can generate virtually any sound, extending from Text-to-Sound to Image-to-Sound. In simpler terms, while our model has the potential to create a vast array of sounds, it was constrained to generate sounds from certain categories in the context of the competition. This situation is similar to a marathon runner participating in a 100-meter sprint. In reality, our AI can synthesize nearly every conceivable sound, ranging from various animal noises to the ambient sounds of an African savannah teeming with hundreds of species. Moreover, its ability to isolate and generate the sound of a single object without noise interference offers significant advantages when the technology is used directly in content production for films and games.   Keunwoo Choi : In order to develop a high-performing and versatile model, we dedicated significant effort from the outset to data collection, arguably the most crucial aspect of AI development. We systematically gathered all possible data worldwide and supplemented gaps in information with the assistance of AI tools, such as ChatGPT. Our aim was to accumulate the best quality data possible. A key initiative in our data collection strategy was the acquisition of 'Wavelab,' a top-tier film sound studio in South Korea. This step enabled us to secure high-quality data. Furthermore, our generative model's design sets Gaudio Lab's AI apart. Our model deviates from traditional AI models that specialize in music or voice, and is designed to create a wide spectrum of sounds or audio signals.     Would you mind sharing your thoughts on behalf of your team about this achievement?   Ben Chon: Gaudio Lab has transcended the constraints of the DCASE task to develop a Text-to-Sound model capable of producing virtually any sound. Recognition from DCASE, which operates within a select range of sound categories, is a strong testament to the maturity of Gaudio Lab's AI development capabilities, bringing us closer to a truly 'universal' sound model.   Our indirect validation of world-class quality across diverse sound categories, some not even covered by DCASE, gives us greater confidence for future research. I believe our team has achieved something truly remarkable. Kudos to all the researchers at Gaudio Lab for their hard work!   Keunwoo Choi : It is extremely rewarding to see the fruits of our continuous research and development in the vast field of generative audio AI. DCASE held its first generative audio challenge, which was relatively straightforward in terms of problem definition. However, our system was already performing well with far more intricate text prompts. I hope we can further develop and commercialize this technology, which possesses infinite possibilities, to cause a significant ripple effect in the audio industry.     Could you please share your future aspirations or vision?   Ben Chon : We believe it's crucial for Gaudio Lab's generative AI to secure practical use cases in the real industry, in addition to making an impact in the academic realm. Having participated in DCASE, our generative AI has grown beyond the Text-to-Sound capability to effectively handle Image-to-Sound tasks. We're also considering expanding our scope to include Video-to-Sound. With technology advancing at an astonishing pace, it's time for us to evolve our focus towards impacting people's lives directly by integrating our technology in real-world industry applications. In fact, our efforts are already bearing fruit, with ongoing discussions about potential collaborations with companies in forward-looking sectors such as film production and the metaverse. I am eager to ensure that Gaudio Lab leads both technological advancement and commercialization, aiming for a future where we stand at the heart of global sound production. We appreciate your continued interest and support for Gaudio Lab's AI technology!     In closing,   I am truly delighted to announce that the relentless efforts of the Gaudio Lab researchers, who have silently navigated uncharted territories, can now be proudly showcased on the global stage. Until we realize our vision of "All sounds in the world originating from Gaudio Lab," we ask for your continued interest and support for Gaudio Lab's AI technology.

2023.06.12