Thanks Apple, Welcome Vision Pro! (ft. Spatial Computing & Spatial Audio)

2023.07.25 by Dewey Yoon

Thanks Apple, Welcome Vision Pro! (ft. Spatial Computing & Spatial Audio)


(Writer: Henney Oh)




At the WWDC in June 2023, Apple grandly showcased their “One More Thing,” a device named Vision Pro, breaking new ground in the realm of Spatial Computing. In typical Apple fashion, they opted to categorize it as a “Spatial Computing” device, eschewing labels like ‘VR HMD’ or ‘AR glasses.’


Gaudio Lab made its first steps into the VR market in 2014, identifying itself and its target market as The Spatial Audio Company for VR. As a result, one of the most frequently asked question has been: “When do you anticipate the VR market will surge?” In response, I cleverly, or perhaps coyly, have always stated, “The day Apple launches a VR device.” 😎


And finally, that day has arrived, precisely a decade later.
(Apple has announced that Vision Pro will debut in Spring 2024)


In the Apple keynote at this year’s WWDC, a major portion of the device introduction session was devoted to explaining the Spatial Audio incorporated into Vision Pro. Apple has consistently invested significant effort into audio – a feature often unnoticed but nonetheless influential to the user experience. It’s worth remembering that the device that catalyzed Apple’s rise was an audio device, the iPod.



[Image: Vision Pro’s Dual Driver Audio Pods (speakers) featuring built-in Spatial Audio]





Apple first debuted its Spatial Audio technology in the AirPods Pro back in 2020. At that time, I hypothesized: “This Spatial Audio represents a strategic primer for the VR/AR device Apple will release in the future.”


While the Spatial Audio experience on a smartphone (or a TV) – essentially viewing a 2D screen within a small window – can be deemed as nice-to-have (useful, but not essential), it becomes a must-have within a VR environment. For instance, in a virtual scenario, the sound of a dog barking behind me shouldn’t seem as if it’s coming from the front.


In a previous post (link), I explained that terms like VR Audio, Immersive Audio, 3D Audio, and Spatial Audio may vary, but their fundamental meaning is essentially the same. They all refer to the technologies that create and reproduce three-dimensional sound.


Is it merely conjecture to suggest that Apple, preparing to launch a Spatial Computing Device in 2024, had already termed their 3D audio technology ‘Spatial Audio’ when it was integrated into the AirPods five years prior?”



Mono → Stereo → Spatial : Transformations in Sound Perception


To begin with, human beings perceive sounds in three dimensions in their immediate environment. We can discern whether the sound of a nearby colleague typing comes from our left or from a position behind and beneath us. This is made possible through the sophisticated interplay of our auditory system and the brain’s binaural hearing skills, employing just two sensors – our ears. As a result, the ideal scenario for all sound played back through speakers or headphones would be a three-dimensional reproduction.


However, due to limitations of devices such as speakers and headphones/earphones, and constraints of communication and storage technologies, we have been conditioned to store, transmit, and reproduce sounds in a 2D (stereo) or 1D (mono) format.


Think of a lecture hall where a speaker’s voice is broadcast through speakers mounted on the ceiling. Even when there’s a noticeable discrepancy between visual cues and the sound’s origin, we adapt without considering the situation strange. This holds true in large concert venues, where sound reaches thousands of audience members through wall-mounted speakers, not directly from the performers onstage. Our adaptability and capacity to learn have enabled us to become comfortable with this mode of sound delivery, and over time, we’ve even developed a learned preference for it.


An apt example can be found in the Atmos Mix, a type of spatial sound format. It is often criticized for sounding inferior to the traditional Stereo Mix. The market standard for an extended period, stereo has been used predominantly in studio recordings, leading us to become accustomed to it. However, reflecting on the past, there was significant resistance from both artists and users during the transition from mono to stereo. This suggests that a future where we become more familiar and comfortable with the Spatial Audio Mix is possible.




Apple’s Commitment to Perfecting Spatial Sound : The Vision Pro


Using Vision Pro can simulate an experience reminiscent of a remote meeting on Star Trek’s Holodeck, making it seem as though the other participant is physically present in the room with you. This could represent the pinnacle of a “Being There” or “Being Here” experience. To facilitate this, Spatial Audio is indispensable. Our brains need to perceive the sound as emanating from the individual in front of us, as if they were truly present, in order to trigger a place illusion. As we turn our heads, the apparent location of the sound must adjust accordingly. Spatial Audio for headphones, fundamentally rooted in *Binaural Rendering, provides exactly this function.


To understand more about Binaural Rendering and its applications...,

Spatial Computing Devices, such as Vision Pro, which encompass both VR and AR technologies, are fundamentally personal display systems. These devices are designed to provide an immersive, individualized visual experience. As such, it is inherent in their design to use headphones for sound reproduction instead of speakers. Binaural rendering is the foundational technology that enables the realization of spatial audio through headphones. The term “binaural” originates from Latin, literally meaning “having two ears”. Humans, equipped with just two ears, can perceive the directionality of sounds – front, back, left, right, above, and below. This is made possible by the diffraction and acoustic shadow phenomena of sounds as they enter our ear canal and resonate throughout our bodies. Binaural rendering is a technology that simulates this natural mechanism and replicates it through headphones, enabling the positioning of sounds within a three-dimensional space.


To convincingly simulate sounds as if they are emanating from within the physical environment, it’s necessary to understand and model the paths that these sounds would traverse to reach our ears in real-world scenarios (sound characteristics change as they interact with various objects, such as walls, sofas, and ceilings). Vision Pro reportedly incorporates Audio Ray Tracing technology to achieve this goal. Although this requires extensive computational power, it is testament to the capabilities of Apple’s silicon (M2 & R1) and underlines Apple’s commitment to perfecting spatial audio.



[Image: Audio Ray Tracing - Screen capture from the Vision Pro Keynote at WWDC 2023]



Thanks Apple, Welcome Vision Pro!


Gaudio Lab has been a pioneer in the spatial audio field, launching a comprehensive suite of innovative tools as early as 2016 and 2017. The suite includes: Works (a tool that allows sound engineers to effortlessly edit and master spatial audio for VR 360 videos within existing sound creation environments like Pro Tools), Craft (a tool that enables the integration of spatial audio into VR content created using game engines such as Unity or Unreal), and Sol (a binaural rendering SDK that enhances head-tracking information on devices like HMDs and smartphones to provide real-time spatial audio experiences).



[Image: VR Audio = Gaudio Lab Keynote]



Post-2018, the VR market experienced a significant downturn, leading to the closure of many tech firms in this domain. However, the team at Gaudio Lab has navigated these turbulent times with resilience, pivoting our technology to suit existing markets and products while refining our techniques. Among our noteworthy accomplishments are:


  • BTRS (the successor to Works), which enables a fully immersive spatial audio experience using standard headphones in a smartphone or 2D screen live-streaming environment.
  • GSA (the successor to Sol), provides a spatial audio experience even when listening to ordinary stereo signals on devices such as earbuds and headphones.





Gaudio Lab’s laboratory is overflowing with a variety of spatial audio products and cutting-edge technologies, living up to its reputation as ‘The Original Spatial Audio Company.’ These include spatial audio for in-car environments, stereo speakers, sound bars, and cinematic settings.


Furthermore, at the upcoming AES 2023 International Conference on Spatial and Immersive Audio (August 23-25, 2023, University of Huddersfield, UK), Gaudio Lab will be presenting a paper on our latest research, titled Room Impulse Response Estimation in a Multiple Source Environment.’


This research centers on innovative AI technology that amplifies immersion by autonomously recognizing and extracting a space’s acoustic characteristics from existing ambient sounds, such as conversational voices. This bypasses the need for distinct equipment such as Apple’s Audio Ray Tracing.


The advent of spatial audio in 2D screen environments like smartphones, TVs, and cinemas is just the tip of the iceberg. The anticipation is indeed tangible as we await the dawn of the Spatial Computing era, where Gaudio Lab’s extensive spatial audio technologies can truly come into their own.


We’ve waited long enough. Thanks, Apple. Welcome, Vision Pro!




Audio Quality Evaluation of Spatial Audio Part 2: Evaluation Result - GAUDIO vs Apple

Audio Quality Evaluation of Spatial Audio Part 2: Evaluation Result - GAUDIO vs Apple   (Writer: James Seo)   Evaluation Result   The results obtained from the previously described evaluation process are as follows:   [Image 1: Result from Stereo Sample]   [Image 2: Result from 5.1 Channel Sample]   [Image 3: Total Result]   Each evaluation target, for both stereo and 5.1 channel formats, could achieve a maximum score of 280 points. If all participants rated one evaluation target as superior across all audio samples, that system would receive a full score of 280 points. Accordingly, GAUDIO’s GSA system achieved a score of 186 points for stereo audio samples and slightly more, 188 points, for multichannel samples. This data suggests that among the evaluated systems, the GSA received higher preference from the participants.   While GAUDIO’s score was promising, it is equally crucial to determine its statistical significance. Given the overall preference for spatially processed signals, we applied basic statistical methods to analyze the results of GSA and ASA. We isolated the trials that compared GSA and ASA and compared the differences in scores by subtracting ASA’s score from GSA’s. If all participants rated the GSA as superior across all audio samples, the average difference would be 1, and -1 if otherwise. Since relying solely on the mean could be insufficient for determining statistical significance, we also calculated the 95% confidence interval. If this confidence interval includes zero, despite a difference in the mean score, we would conclude that there is no statistical difference between the two systems. Therefore, for GSA to be statistically superior, the mean should be greater than zero, and zero should not be included in the confidence interval.   [Image 4: Comparison of GSA-ASA result ]   According to the graph, for both stereo and 5.1 channel signals, the mean is above zero, and the 95% confidence interval does not include zero. This trend persists even when combining both sets of results. Therefore, it’s not merely about GSA earning more points, but at a statistically significant level, the sound rendered by GSA was assessed as superior compared to that rendered by ASA. But how do the comparative results between GSA and the original, and ASA and the original fare? We used the same method to calculate the mean and the 95% confidence interval, leading to the following outcomes:   [Image 5: Comparison of GSA - Original]   [Image 6: Comparison of ASA - Original]   Upon reviewing the GSA-ORI results, we find that in all cases, the mean is greater than zero, and zero is not included within the confidence interval. Compared to the GSA-ASA results, this confirms that GSA was selected as the preferred sound, statistically and significantly. On the other hand, in the ASA-ORI results, no statistically significant difference was detected between the original and ASA-rendered versions of 5.1 channel audio samples. Although not of great significance, it is noteworthy that the mean score is below zero. Combining all these findings, the sound rendered by GSA was most preferred across all evaluated audio sample formats. This preference for GSA holds true for the 5.1 channel format as well. However, for the downmixed original and ASA-rendered sound, there was no clear preference indicated by the results.   Conclusion   Sound, being both invisible and intangible, often presents a considerable challenge when attempting to clearly communicate its quality. This challenge becomes particularly pertinent when endeavoring to demonstrate the superior sound experience provided by a sophisticated system such as the GSA. Widespread acceptance and application of this technology across markets require us to effectively express the excellence of the system. However, as detailed in the main body of this text, implementing an objective evaluative method for sound is essentially unfeasible. To circumvent this issue, and to objectively represent the performance of GSA for those unable to personally experience it, we undertook this experiment. The aim was to showcase how individual preferences manifest when their experiences with the GSA are compared against their prior experiences with either original sound or Apple’s spatial audio.   Furthermore, we hoped that the GSA achieving positive results in this experiment would boost the confidence and morale of Team Gaudio, who have been tirelessly dedicated to the research, development, and commercialization of this technology. Fortunately, the outcomes were indeed encouraging and brought us a sense of satisfaction. Although the complexities involved in describing the experimental procedures, the results, and the analysis might make this article somewhat intricate, we sincerely hope that this effort, along with our previous work on M2S Latency, will aid in deepening readers’ understanding of the GSA.

ICML Paper Preview: A demand-driven perspective on Generative Audio AI 

ICML Paper Preview: A demand-driven perspective on Generative Audio AI    (Writer: Rio Oh)      Greetings. I am Rio, a researcher at Gaudio Lab’s AI Research Team, dedicated to the development of the sound generation model, FALL-E. I have a keen interest in generative models, and my recent work involves the exploration of how generative model strategies can be applied to diverse tasks.   Our team has been reflecting on the essential areas that require enhancements for real-world, industrial applications. We aim to present these insights at the upcoming ICML workshop, and I’m pleased to provide a sneak preview of our findings!     Introduction   Through our blog, we’ve previously highlighted Gaudio Lab’s achievements in the DCASE Challenge.   We’re thrilled to share another significant milestone from Gaudio Lab: the acceptance of our paper for presentation at the ICML workshop (Hooray🥰).   ICML, in conjunction with NeurIPS, is recognized as one of the world’s leading AI conferences, attracting considerable global attention. Over the final two days of the conference period, thematic workshops are organized. Only papers that have passed a stringent double-blind peer review process are selected for presentation at these sessions.     Generative Audio AI: Pioneering the Field with Gaudio Lab   Comparatively, audio generation (excluding speech) is still in its infancy when contrasted with the text and image sectors. Looking beyond the familiar domain of text into images, a range of commercial and non-commercial services are already utilizing Diffusion models such as DALL-E and are easily accessible to the public.   However, audio lags behind with no public services yet launched, primarily due to technological maturity and computational resource constraints. (While sharing of demos and models for disseminating experimental results from papers is gradually increasing, the availability of services that the general public can utilize is virtually non-existent.)   In this challenging environment, Gaudio Lab aspires to create AI products that transcend mere demos, potentially revolutionizing existing paradigms. We’ve embarked on a process to evaluate and address the prevailing challenges and limitations. This important endeavor, aimed at tuning into the industry’s voice, seeks to bring research-stage audio AI products into mainstream visibility.   Through this, Gaudio Lab intends to stay connected with the broader industry context and its processes (while maintaining a strong focus on research), striving to refine our future research directions with greater precision.   We are excited to present our insights and learnings at the 2023 Challenges in Deployable Generative AI workshop! (Scheduled for: Fri 28 Jul, 9 a.m. HST & Sat 29 Jul, 4 a.m. KST)     [image = workshop poster]       Hold on, what exactly is Gaudio Lab’s FALL-E mentioned in the "Motivations" section?   FALL-E is an innovative technology from Gaudio Lab, utilizing AI-based Text-to-Sound Generation to create sounds in response to text or image inputs. It can generate not only actual real-world sounds, such as a cat’s cry or thunder, but also limitless virtual sounds, for instance, the sound of a tiger smoking – an imaginative representation of an unreal scenario.   This ability significantly broadens the horizons for content creation through sound. The produced sounds can serve as sound effects and background noises during the development of content and virtual environments. Accordingly, FALL-E is projected to be an essential sound technology in all environments offering immersive experiences.                  Allow me to explain more about FALL-E! Did the name give you a hint? It’s also an allusion to Foley sound.   Foley is a technique used in film post-production to recreate sound effects. For instance, creating the sound of horse hooves by alternately hitting two bowls on the ground. The term originated in the 1930s, named after Jack Foley.   Foley sound creation is vital in content production. However, reusing recorded sounds and associated economic inefficiencies remain challenges, resulting in a continued reliance on manual work.   As a result, addressing this issue with a generative model presents a promising approach – an approach Gaudio Lab is focused on.           So, what were the challenges we encountered during this research?   In preparing this paper, Gaudio Lab conducted a survey involving professionals from the film sound industry, and the responses were incorporated into our research. To provide a brief overview of our findings, the most significant hurdles were: 1) the need for superior sound quality, and 2) the ability to exercise detailed control. [Link to the full paper]     How exactly did Gaudio Lab overcome these challenges during FALL-E’s development?   One significant hurdle was the scarcity of clean, high-quality data. This issue was compounded by the large quantity of data needed by generative models. Gaudio Lab’s solution was to use both clean and relatively noisier data in tandem, incorporating specific conditions into the model.   Generative models draw from more than just the samples they aim to create; they also use a variety of supplementary information, such as text, categories, and videos, as learning inputs. We added another layer of context to this process by tagging each dataset based on its source.   This method enabled the model to decide whether to generate clean or noisy sounds during the production process. Indeed, during our participation in the DCASE Challenge, our model earned praise for its ability to produce a wide array of sounds while maintaining high audio quality.   The competition involved selecting the top four contenders per track using an objective evaluation metric (FAD), which was followed by a listening evaluation. As evidenced by the results, Gaudio Lab scored highly in categories assessing both clean sound quality and the ability to generate a diverse range of sounds.   These impressive results were achieved even though we only participated in certain categories of the competition with our FALL-E model, which is capable of generating all types of sounds. [image = DCASE 2023 Challenge Task 7 Results] :: You can see more details here.     Although FALL-E could be considered the best-performing model in terms of sound quality among those currently available, we do not intend to stop here. Gaudio Lab is continuously exploring ways to develop models that can generate even more superior sounds.     In truth, the birth of FALL-E was not without its trials and tribulations.   When Gaudio Lab first conceived the idea of FALL-E and initiated the research in 2021, there were scarcely any scholarly papers addressing text-based AI Foley synthesis models. Furthermore, research into video-based sound effects was somewhat limited, and the performance of such models didn’t seem particularly promising. (It is worth mentioning that the landscape has significantly changed since then with a plethora of relevant research now available.)   At times, we found ourselves pondering the direction of our research. Yet, it was Keunwoo’s (Link) effective leadership that helped harness the team’s collective energy. He leveraged the accumulated knowledge and experiences we had gathered over time, leading us to press forward despite initial doubts and concerns. Looking back, I can’t help but think that this continuous process of defining and adjusting our direction was a crucial step in discovering the ‘right path.’   As a team, we enthusiastically threw ourselves into the project, pausing occasionally to fine-tune our trajectory. We remained flexible, adjusting our course in response to new challenges, and ultimately reaching our intended goal. This, I believe, is an apt representation of Gaudio Lab’s AI Research team’s approach to work.   Before we knew it, we were not only organizing the DCASE but also participating in it. We were thrilled to achieve excellent results, even though our participation was rather casual. Ultimately, I find myself in Hawaii. Initially, the decision to visit Hawaii was taken with the idea of expanding our horizons, even if our paper was not accepted. However, our paper’s acceptance at the ICML conference has made this trip all the more meaningful. I look forward to fully immersing myself in this exciting and fruitful journey before returning to Korea.   And on that note, I conclude my updates from Hawaii🏝!