Audio Quality Evaluation of Spatial Audio Part 2: Evaluation Result - GAUDIO vs Apple

2023.07.19 ・ by Luke Bae

Audio Quality Evaluation of Spatial Audio Part 2: Evaluation Result - GAUDIO vs Apple

(Writer: James Seo)

Evaluation Result

The results obtained from the previously described evaluation process are as follows:

[Image 1: Result from Stereo Sample]

[Image 2: Result from 5.1 Channel Sample]

[Image 3: Total Result]

Each evaluation target, for both stereo and 5.1 channel formats, could achieve a maximum score of 280 points. If all participants rated one evaluation target as superior across all audio samples, that system would receive a full score of 280 points. Accordingly, GAUDIO’s GSA system achieved a score of 186 points for stereo audio samples and slightly more, 188 points, for multichannel samples. This data suggests that among the evaluated systems, the GSA received higher preference from the participants.

While GAUDIO’s score was promising, it is equally crucial to determine its statistical significance. Given the overall preference for spatially processed signals, we applied basic statistical methods to analyze the results of GSA and ASA. We isolated the trials that compared GSA and ASA and compared the differences in scores by subtracting ASA’s score from GSA’s. If all participants rated the GSA as superior across all audio samples, the average difference would be 1, and -1 if otherwise. Since relying solely on the mean could be insufficient for determining statistical significance, we also calculated the 95% confidence interval. If this confidence interval includes zero, despite a difference in the mean score, we would conclude that there is no statistical difference between the two systems. Therefore, for GSA to be statistically superior, the mean should be greater than zero, and zero should not be included in the confidence interval.

[Image 4: Comparison of GSA-ASA result ]

According to the graph, for both stereo and 5.1 channel signals, the mean is above zero, and the 95% confidence interval does not include zero. This trend persists even when combining both sets of results. Therefore, it’s not merely about GSA earning more points, but at a statistically significant level, the sound rendered by GSA was assessed as superior compared to that rendered by ASA.

But how do the comparative results between GSA and the original, and ASA and the original fare? We used the same method to calculate the mean and the 95% confidence interval, leading to the following outcomes:

[Image 5: Comparison of GSA - Original]

[Image 6: Comparison of ASA - Original]

Upon reviewing the GSA-ORI results, we find that in all cases, the mean is greater than zero, and zero is not included within the confidence interval. Compared to the GSA-ASA results, this confirms that GSA was selected as the preferred sound, statistically and significantly. On the other hand, in the ASA-ORI results, no statistically significant difference was detected between the original and ASA-rendered versions of 5.1 channel audio samples. Although not of great significance, it is noteworthy that the mean score is below zero. Combining all these findings, the sound rendered by GSA was most preferred across all evaluated audio sample formats. This preference for GSA holds true for the 5.1 channel format as well. However, for the downmixed original and ASA-rendered sound, there was no clear preference indicated by the results.

Conclusion

Sound, being both invisible and intangible, often presents a considerable challenge when attempting to clearly communicate its quality. This challenge becomes particularly pertinent when endeavoring to demonstrate the superior sound experience provided by a sophisticated system such as the GSA. Widespread acceptance and application of this technology across markets require us to effectively express the excellence of the system. However, as detailed in the main body of this text, implementing an objective evaluative method for sound is essentially unfeasible. To circumvent this issue, and to objectively represent the performance of GSA for those unable to personally experience it, we undertook this experiment. The aim was to showcase how individual preferences manifest when their experiences with the GSA are compared against their prior experiences with either original sound or Apple’s spatial audio.

Furthermore, we hoped that the GSA achieving positive results in this experiment would boost the confidence and morale of Team Gaudio, who have been tirelessly dedicated to the research, development, and commercialization of this technology. Fortunately, the outcomes were indeed encouraging and brought us a sense of satisfaction. Although the complexities involved in describing the experimental procedures, the results, and the analysis might make this article somewhat intricate, we sincerely hope that this effort, along with our previous work on M2S Latency, will aid in deepening readers’ understanding of the GSA.

GSA(GAUDIO Spatial Audio)Spatial AudioVR/AR

Audio Quality Evaluation of Spatial Audio Part 1: Designing the Evaluation

Audio Quality Evaluation of Spatial Audio Part 1: Designing the Evaluation (Writer: James Seo) My name is James, a specialist in Research and Development for GSA (GAUDIO Spatial Audio). In our previous discussion, we explored the measurement of M2S (Motion-to-Sound) Latency, an indicator of GSA’s responsiveness to user movements. Today, we seek to address the following question: “Is the sound we hear really that good?” Given that GSA is designed specifically for wearable devices such as True Wireless Stereo (TWS) and Head-Mounted Displays (HMD), sound quality is a vital factor. Regardless of its swift responsiveness to movements, it cannot qualify as an exceptional product if the sound quality is subpar. Sound matters! Methods of Audio Quality Evaluation Before delving into the evaluation techniques for GSA’s performance, it is necessary to familiarize ourselves with the methods used to assess audio quality. Various strategies exist for evaluating the performance of an acoustic device or system. A prevalent approach involves measuring performance based on parameters extrapolated from the sound being reproduced. Our prior discussion on the M2S latency measurement serves as a quintessential example. In addition, standardized methodologies such as PEAQ (Perceptual Evaluation of Audio Quality) and PESQ (Perceptual Evaluation of Speech Quality) are commonly used for assessing audio/voice codecs’ efficacy. Typically, these techniques analyze individual sounds to calculate Model Output Variables (MOVs), elements that influence perceptual quality. The final quality score is derived from a weighted summation of these values. This type of evaluation is termed Objective Quality Evaluation. These evaluations involve feeding an acoustic signal into software or a device and then calculating the final quality score, a process praised for its efficiency due to the relatively short time required. However, these standardized objective evaluation methodologies have their limitations. Since they are designed to assess the degree of degradation in the quality of the signal under test (SUT) relative to a reference signal, they become impractical in the absence of such a reference. This constraint is inherent in these standardized methodologies, given they were initially developed to evaluate codec performance. Another evaluation strategy is the Subjective Quality Evaluation method. Here, the evaluator listens to and compares the audio source under evaluation, judging its quality based on personal criteria. An example of this evaluation method is MUSHRA (Multiple Stimuli with Hidden Reference and Anchor). Nonetheless, as personal standards can vary significantly, obtaining reliable results necessitates a large pool of evaluators, which presents a logistical challenge. Besides the need for numerous evaluators, this approach can be both time-consuming and expensive since each evaluator must personally listen to and assess the sound. Lastly, as implied by its name (Hidden Reference and Anchor), it shares the limitation of being applicable only when a reference signal is available. We are now in the process of choosing how to evaluate the GSA. Given that spatial audio signals do not have a suitable reference signal, we cannot employ objective evaluation methods like Perceptual Evaluation of Audio Quality (PEAQ). Similarly, Multiple Stimuli with Hidden Reference and Anchor (MUSHRA), a subjective evaluation method, is not applicable for the same reason. Moreover, conducting a subjective evaluation based solely on the output signal of the GSA presents significant challenges for evaluators and can make it difficult to produce reliable results. After careful consideration, we’ve decided to compare the GSA with familiar, widely-used solutions that have a well-established reputation for quality. Although there were several spatial audio solutions that could have been considered, increasing the number of comparative systems also expands the quantity of signals requiring comparison. This circumstance could put significant pressure on the evaluators and potentially reduce the reliability of the results. As a result, we chose to limit our comparison to Apple’s Spatial Audio (hereinafter referred to as ASA), ensuring a focused one-to-one comparison. Design of Subjective Quality Evaluation (1) Preference Testing through Paired Comparison Our primary comparison method is a preference test performed using a paired comparison. This process involves presenting two signals in a random sequence and determining a preference. The preferred system is awarded a score of +1, while the non-preferred system receives a score of 0. This method can be likened to asking someone: “Who do you love more, Mom or Dad?” We’ve adopted the Double-Blind Forced Choice technique for this evaluation, which means the evaluators cannot know whether the sound they are hearing has been rendered by ASA or GSA. Since evaluators simply listen to signals and randomly select a preference from two options (A or B), intentional bias can be effectively minimized. (2) Selection of Sound Excerpts The next step involves selecting the sound sources for evaluation. Since the performance of a solution can vary based on the characteristics and format (number of channels) of a sound source, the results may differ depending on which sound source is used for evaluation. Initially, we differentiated and selected sound sources based on 2-channel stereo and 5.1 multi-channel. Although 2-channel stereo is the most common format encountered by users in everyday environments, we also included 5.1 channel sound sources because some films and music sources are mixed into 5.1 channels to enhance the sense of spatiality. For stereo sound sources, we selected one song from each of various music genres, and added a few movie clips in stereo version, resulting in seven sound sources in total. We also selected seven sound sources with various characteristics, including films, music, and applause, for multi-channel audio. Each chosen sound source was trimmed to a length of 10-15 seconds, considered to be the most appropriate duration for subjective evaluation. (3) Generation of Evaluation Signals The primary goal of this evaluation is to measure the quality of Spatial Audio that dynamically adapts to user movements. Ideally, we would render each evaluation audio source to reflect actual user movements. However, due to the exclusivity of Apple’s Spatial Audio (ASA) within the Apple product ecosystem, we encountered limitations in constructing the ideal experimental environment. It was impractical to secretly implement alternative Spatial Audio renderers on devices such as the AirPods Pro or iPhone for evaluation purposes. As a result, we resorted to creating signals separately, rendering them for fixed head orientations, and selecting front-facing orientations that users encounter most frequently. Another challenge is capturing sound from ASA, given it only operates within Apple’s ecosystem. Fortunately, at Gaudio Lab, we managed to acquire filters associated with ASA. Despite current iOS updates blocking this route of acquisition, there was a period when we could capture signals transmitted to True Wireless Stereo (TWS) devices like the AirPods Pro. We accomplished this by activating the Spatial Audio feature and playing the audio source. Although it is feasible to play the desired audio source from an iPhone and capture the actual rendered signal, using specific signals such as a swept sine can also directly yield ASA’s filter coefficients. After combining the obtained filter and audio source and replaying it through the AirPods Pro, the sound rendered with Spatial Audio reproduces identically to an actual iPhone/AirPods Pro setup. Alternatively, using an ear simulator equipped with AirPods Pro to capture the response of ASA in its on/off states—excluding the TWS response—also offers a method to obtain filter coefficients. However, this approach is somewhat technical and diverges from the main discussion, and thus will not be covered here. For the listening evaluation, we used the AirPods Pro as the evaluation medium to enable a fair comparison between ASA and GSA. This approach minimizes the impact of variations in quality due to differences in the final playback device, facilitating a more focused evaluation of the renderer’s performance in implementing spatial audio. In addition, we included original signals that bypassed both ASA and GSA in our comparative analysis. Should the evaluation reveal that both ASA and GSA underperform compared to the original signal, it would render the comparative exercise futile. Reviewing these results offers insights into evaluators’ overall preference for spatial audio against the original signal. We treated a downmixed audio source from 5.1 channels-to-2 channels as the original signal. As a result, the final set of comparative signals for each evaluation audio excerpt is organized as follows. GSA vs. ASA GSA vs. Original ASA vs. Original (4) Setting for Subjective Quality Evaluation The setting for subjective quality evaluation, where evaluators conduct their assessments, is outlined below: As demonstrated in the figure, evaluators can identify only the name of the evaluation audio excerpt. They are kept unaware of which signals – ASA, GSA, or the original – have been allocated to options A and B. The individuals designing the evaluation are equally unaware of how options A and B are allocated, and they cannot dictate which audio source appears first for assessment. The sequence of the audio sources is randomized by the system, ensuring that their order does not influence the results. Within this interface, evaluators are tasked to select what they believe to be the superior option between the two presented. Upon making this selection and clicking the ‘Next Test’ button, they proceed to the subsequent evaluation. This procedure embodies the double-blind forced-choice method previously discussed. Evaluators are allowed to repeat specific sections during the assessment of a single audio source without any imposed restrictions. As the evaluation system is web-based and hosted on the Gaudio Lab’s server, it allows for concurrent evaluations by multiple users. Upon the conclusion of each assessment session, the results are stored on the server. (5) The Evaluation Panel The current round of evaluations included a total of 20 adults, both men, and women, ranging in age from 20 to 40. Among these participants, 11 were seasoned evaluators with extensive experience in auditory assessments. To incorporate perspectives from the general public, we included nine individuals who lacked specialized knowledge in sound quality evaluation or related technology but displayed a keen interest in activities such as regular music listening. The result will be revealed in Part 2. Stay Tuned!

2023.07.13

Thanks Apple, Welcome Vision Pro! (ft. Spatial Computing & Spatial Audio)

Thanks Apple, Welcome Vision Pro! (ft. Spatial Computing & Spatial Audio) (Writer: Henney Oh) ONE MORE THING! At the WWDC in June 2023, Apple grandly showcased their “One More Thing,” a device named Vision Pro, breaking new ground in the realm of Spatial Computing. In typical Apple fashion, they opted to categorize it as a “Spatial Computing” device, eschewing labels like ‘VR HMD’ or ‘AR glasses.’ Gaudio Lab made its first steps into the VR market in 2014, identifying itself and its target market as The Spatial Audio Company for VR. As a result, one of the most frequently asked question has been: “When do you anticipate the VR market will surge?” In response, I cleverly, or perhaps coyly, have always stated, “The day Apple launches a VR device.” 😎 And finally, that day has arrived, precisely a decade later.(Apple has announced that Vision Pro will debut in Spring 2024) In the Apple keynote at this year’s WWDC, a major portion of the device introduction session was devoted to explaining the Spatial Audio incorporated into Vision Pro. Apple has consistently invested significant effort into audio – a feature often unnoticed but nonetheless influential to the user experience. It’s worth remembering that the device that catalyzed Apple’s rise was an audio device, the iPod. [Image: Vision Pro’s Dual Driver Audio Pods (speakers) featuring built-in Spatial Audio] Spatial Audio, NICE TO HAVE → MUST HAVE Apple first debuted its Spatial Audio technology in the AirPods Pro back in 2020. At that time, I hypothesized: “This Spatial Audio represents a strategic primer for the VR/AR device Apple will release in the future.” While the Spatial Audio experience on a smartphone (or a TV) – essentially viewing a 2D screen within a small window – can be deemed as nice-to-have (useful, but not essential), it becomes a must-have within a VR environment. For instance, in a virtual scenario, the sound of a dog barking behind me shouldn’t seem as if it’s coming from the front. In a previous post (link), I explained that terms like VR Audio, Immersive Audio, 3D Audio, and Spatial Audio may vary, but their fundamental meaning is essentially the same. They all refer to the technologies that create and reproduce three-dimensional sound. Is it merely conjecture to suggest that Apple, preparing to launch a Spatial Computing Device in 2024, had already termed their 3D audio technology ‘Spatial Audio’ when it was integrated into the AirPods five years prior?” Mono → Stereo → Spatial : Transformations in Sound Perception To begin with, human beings perceive sounds in three dimensions in their immediate environment. We can discern whether the sound of a nearby colleague typing comes from our left or from a position behind and beneath us. This is made possible through the sophisticated interplay of our auditory system and the brain’s binaural hearing skills, employing just two sensors – our ears. As a result, the ideal scenario for all sound played back through speakers or headphones would be a three-dimensional reproduction. However, due to limitations of devices such as speakers and headphones/earphones, and constraints of communication and storage technologies, we have been conditioned to store, transmit, and reproduce sounds in a 2D (stereo) or 1D (mono) format. Think of a lecture hall where a speaker’s voice is broadcast through speakers mounted on the ceiling. Even when there’s a noticeable discrepancy between visual cues and the sound’s origin, we adapt without considering the situation strange. This holds true in large concert venues, where sound reaches thousands of audience members through wall-mounted speakers, not directly from the performers onstage. Our adaptability and capacity to learn have enabled us to become comfortable with this mode of sound delivery, and over time, we’ve even developed a learned preference for it. An apt example can be found in the Atmos Mix, a type of spatial sound format. It is often criticized for sounding inferior to the traditional Stereo Mix. The market standard for an extended period, stereo has been used predominantly in studio recordings, leading us to become accustomed to it. However, reflecting on the past, there was significant resistance from both artists and users during the transition from mono to stereo. This suggests that a future where we become more familiar and comfortable with the Spatial Audio Mix is possible. Apple’s Commitment to Perfecting Spatial Sound : The Vision Pro Using Vision Pro can simulate an experience reminiscent of a remote meeting on Star Trek’s Holodeck, making it seem as though the other participant is physically present in the room with you. This could represent the pinnacle of a “Being There” or “Being Here” experience. To facilitate this, Spatial Audio is indispensable. Our brains need to perceive the sound as emanating from the individual in front of us, as if they were truly present, in order to trigger a place illusion. As we turn our heads, the apparent location of the sound must adjust accordingly. Spatial Audio for headphones, fundamentally rooted in *Binaural Rendering, provides exactly this function. To understand more about Binaural Rendering and its applications..., Spatial Computing Devices, such as Vision Pro, which encompass both VR and AR technologies, are fundamentally personal display systems. These devices are designed to provide an immersive, individualized visual experience. As such, it is inherent in their design to use headphones for sound reproduction instead of speakers. Binaural rendering is the foundational technology that enables the realization of spatial audio through headphones. The term “binaural” originates from Latin, literally meaning “having two ears”. Humans, equipped with just two ears, can perceive the directionality of sounds – front, back, left, right, above, and below. This is made possible by the diffraction and acoustic shadow phenomena of sounds as they enter our ear canal and resonate throughout our bodies. Binaural rendering is a technology that simulates this natural mechanism and replicates it through headphones, enabling the positioning of sounds within a three-dimensional space. To convincingly simulate sounds as if they are emanating from within the physical environment, it’s necessary to understand and model the paths that these sounds would traverse to reach our ears in real-world scenarios (sound characteristics change as they interact with various objects, such as walls, sofas, and ceilings). Vision Pro reportedly incorporates Audio Ray Tracing technology to achieve this goal. Although this requires extensive computational power, it is testament to the capabilities of Apple’s silicon (M2 & R1) and underlines Apple’s commitment to perfecting spatial audio. [Image: Audio Ray Tracing - Screen capture from the Vision Pro Keynote at WWDC 2023] Thanks Apple, Welcome Vision Pro! Gaudio Lab has been a pioneer in the spatial audio field, launching a comprehensive suite of innovative tools as early as 2016 and 2017. The suite includes: Works (a tool that allows sound engineers to effortlessly edit and master spatial audio for VR 360 videos within existing sound creation environments like Pro Tools), Craft (a tool that enables the integration of spatial audio into VR content created using game engines such as Unity or Unreal), and Sol (a binaural rendering SDK that enhances head-tracking information on devices like HMDs and smartphones to provide real-time spatial audio experiences). [Image: VR Audio = Gaudio Lab Keynote] Post-2018, the VR market experienced a significant downturn, leading to the closure of many tech firms in this domain. However, the team at Gaudio Lab has navigated these turbulent times with resilience, pivoting our technology to suit existing markets and products while refining our techniques. Among our noteworthy accomplishments are: BTRS (the successor to Works), which enables a fully immersive spatial audio experience using standard headphones in a smartphone or 2D screen live-streaming environment. GSA (the successor to Sol), provides a spatial audio experience even when listening to ordinary stereo signals on devices such as earbuds and headphones. Gaudio Lab’s laboratory is overflowing with a variety of spatial audio products and cutting-edge technologies, living up to its reputation as ‘The Original Spatial Audio Company.’ These include spatial audio for in-car environments, stereo speakers, sound bars, and cinematic settings. Furthermore, at the upcoming AES 2023 International Conference on Spatial and Immersive Audio (August 23-25, 2023, University of Huddersfield, UK), Gaudio Lab will be presenting a paper on our latest research, titled ‘Room Impulse Response Estimation in a Multiple Source Environment.’ This research centers on innovative AI technology that amplifies immersion by autonomously recognizing and extracting a space’s acoustic characteristics from existing ambient sounds, such as conversational voices. This bypasses the need for distinct equipment such as Apple’s Audio Ray Tracing. The advent of spatial audio in 2D screen environments like smartphones, TVs, and cinemas is just the tip of the iceberg. The anticipation is indeed tangible as we await the dawn of the Spatial Computing era, where Gaudio Lab’s extensive spatial audio technologies can truly come into their own. We’ve waited long enough. Thanks, Apple. Welcome, Vision Pro!

2023.07.25