뒤로가기back

Android Spatial Audio support, so… what is it?

2023.03.15 by Dewey Yoon

Android Spatial Audio support, so… what is it?

(2023. 03. 15)

 

In April 2022, rumors were circulating about Android Spatial Audio support, and in January 2023, it was officially introduced through a Google Pixel update. Currently, not all Android smartphones can experience it, and only some Pixel smartphones and Pixel Buds Pro can connect and experience Android Spatial Audio (as of the publication date of March 15, 2023). However, it is expected that the number of supported devices will gradually increase in the future.

If you’re considering implementing Android Spatial Audio, here’s a quick overview of what’s changed and what you’ll need to take into consideration.

 

 
Figure 1. Supported Spatial Audio when Google Pixel smartphones and Pixel Buds Pro are connected.

 

Spatial Audio, a giant concert hall in your tiny earbuds

Before introducing Android Spatial Audio, let us briefly explain spatial audio technology. Typically, to experience spatial sound, you need a speaker environment of 5.1 channels or more. In addition, you require a dedicated space to place each speaker, and there are many cumbersome tasks, such as setting up an amplifier. As a result, it is challenging to enjoy spatial sound while commuting or exercising outdoors.

 

Spatial Audio technology overcomes these physical limitations and reproduces immersive sound, as if you are in a real space, even in an earphone (headphone) environment. It’s like having a 7.1.4 channel-equipped listening room right in your ears. Apple’s Apple Spatial Audio, which was applied to AirPods Pro, and some flagship smartphones such as Samsung Galaxy, allow you to experience spatial audio technology.

 


 
Figure 2. Apple Spatial Audio and Samsung Galaxy’s 360 Audio.

 

To see what has changed with Android Spatial Audio support, it’s recommended to have an understanding of how spatial audio has been implemented up to this point.

 

The current spatial audio available on the market produces sound through earbuds or headphones that are supported by smartphone manufacturers and directly implemented and run on smartphones. This process of reproducing stereo audio, such as MP3, into spatial audio is typically referred to as a renderer, and smartphone manufacturers implement it through their own development or third-party licensing. From a different perspective, it can be observed that there is no standardized method of spatial audio used by manufacturers, resulting in fragmentation.

 

In addition, head-tracking implementation is necessary to give the feeling that the sound is coming into the space and reflecting the movement of the head. If, for example, you briefly turn your head to the right in a real concert hall and the singer’s voice in front of you follows your head movement, it creates an unrealistic situation. It significantly undermines immersion and makes it impossible to achieve the experience of being in that space. Without reflecting the information about the movement of the head in the sound space, the sound that is heard in front of the eyes continues to be heard along with the movement of the head, which undermines immersion.

 

On the other hand, head-tracking plays an important role in the externalization effect, making the sound appear to come from outside the head. According to psychoacoustics, people can distinguish the front-back and left-right of the sound more accurately using only two ears that move unconsciously with the movement of the head. By utilizing head-tracking in spatial audio implementation, this becomes possible, maximizing the experience of sound appearing to come from outside the head (as if it were coming from the speakers).

 

Figure 3. Spatial Audio without head-tracking (left) and Spatial Audio with head-tracking applied (right).

 

To support head-tracking-based spatial audio, a sensor capable of recognizing head movements is required, and the IMU (Inertial Motion Unit) sensor embedded in TWS is the main component that fulfills this requirement.

 

The IMU sensor contains a gyroscope, an accelerometer, a magnetometer, and other components that recognize 6-axis or 9-axis head movement information and transmit it to the smartphone through a Bluetooth channel.

 

The current structure of spatial audio (as shown below) involves processing the Bluetooth communication process twice, resulting in a time delay between the actual head movement and the sound reflecting that movement (known as Motion-to-Sound (M2S) Latency, which will be discussed in detail in the following article (soon to be published). Minimizing this delay is essential to implementing natural spatial audio. (Android also recommends implementing this delay to be less than 150ms.)

 

Figure 4. Spatial Audio Operation Structure

 

Android Spatial Audio, so what has changed?

The main reason for introducing Android Spatial Audio is to standardize the different methods of spatial audio implementation used by different manufacturers. However, while the basic structure is standardized, the actual renderer, which is the key to realizing spatial audio, must be implemented directly by the manufacturers. When smartphone manufacturers implement the renderer, they can easily integrate it into a block called “Spatializer.” This forms the core of Android Spatial Audio.

 

So, has the Android OS structure changed due to the introduction of this new ‘Spatializer’? No, it has not. All audio functions performed on Android are handled through collaboration among services such as AudioService[1], AudioPolicyService[2], AudioFlingerService[3].

 

The spatial audio feature has been designed and developed to be executed within this existing audio framework, reducing the development burden. Traditionally, manufacturers customized their audio policies by adding them to AudioPolicyService. Similarly, the spatial audio feature has been designed to be added to the Spatializer within AudioPolicyService, without significantly affecting the existing implementation.

 

From a user experience standpoint, it is likely that the spatial audio is controlled by the app and outputted to the earbuds via Bluetooth. The interface with the app is handled by the Spatializer Helper within AudioService, while rendering is handled by the SpatializerThread within AudioFlingerService. This confirms that the existing Android structure has been inherited without significant changes.

 

As an app developer, you may be wondering how to apply spatial audio to your app. However, the widely used player in Android, ExoPlayer, supports easy implementation of spatial audio without understanding such frameworks. Since version 2.18, ExoPlayer automatically selects multi-channel tracks and provides spatial audio control.

 

In addition, the aforementioned head-tracking implementation requires the use of IMU sensor information. To update the spatial audio renderer based on the sensor information, the Head Tracking HID sensor class has been added to the Sensor Service framework, providing a standardized channel between the Sensor Service and Audio Service. Furthermore, it is recommended that this IMU sensor information strictly follows the HID (Human Interface Devices) protocol.

 

The HID protocol is a protocol designated by the USB Implementers Forum, originally defined for PS/2 and USB communication between peripheral devices such as keyboards and mice and host devices. As Bluetooth devices have become more widespread, the HID profile for such devices has been defined, expanding the protocol’s support range. It is this protocol that smartphones and earbud devices use to exchange IMU sensor information.

 

[1] Plays as an interface between your app and the audio framework.
[2] Receives and processes audio control requests, such as volume control, and can apply manufacturers’ audio system policies to them. The service then requests the AudioFlingerService to apply these implemented audio policies to the current audio input and output process.
[3] Responsible for controlling audio inputs and outputs. To do so, it requires control audio hardware from different manufacturers and with different drivers in a unified way. This is achieved by a Hardware Abstraction Layer (HAL), which serves as an interface to the hardware. The audio inputs and outputs controlled here apply the audio policy received from the AudioPolicyService.

 

Figure 5. Changes in the Android stack with the addition of Android Spatial Audio
 
Source : Google Android Source (https://source.android.com/docs/core/audio/spatial )

 

Android Spatial Audio is here, but a lot of work still needs to be done.

Thanks to the easy-to-use panel for applying spatial audio using Android Spatial Audio, Android device manufacturers can now focus on how to implement spatial audio effectively. However, there are certain issues that manufacturers need to consider due to some support limitations. Let’s take a closer look at what needs to be considered.

 

• Manufacturers must implement the spatializer or renderer themselves. They also need to design their products while taking into account the time delay that can occur due to head tracking support. However, achieving the recommended time delay of 150 milliseconds or less in Android Spatial Audio is quite challenging, as previously explained.

 

• Processing is only possible on high-performance devices that support Android 13. This means that it cannot be implemented on devices that do not support Android, such as earbuds. It is also possible to implement the renderer directly on earbuds without using the Android Spatial Audio stack, but we will introduce this method again at a later opportunity. 

 

• For manufacturers who require a consistent spatial audio experience across various devices, such as smartphones, tablets, TVs, and laptops, the burden is significantly increased. They need to consider not only smartphones but also other devices like TVs, earbuds, or headphones. Even if the earbuds or headphones are of high quality, if spatial audio is poorly implemented on the smartphone used to connect them, users may experience unintended sound effects.

 

• At present, spatial audio is only supported for 5.1 channel audio, and stereo audio is not supported. As the ratio of stereo content is much higher than that of 5.1 channel content, the opportunity for users to actually utilize this feature is quite low.

 

If you are a manufacturer considering implementing Android Spatial Audio, just ask Gaudio Lab. Our solution includes not only optimized libraries for high-quality renderers (spatializers) that manufacturers need to implement themselves, but also incorporates various know-how to minimize time delay.

Gaudio Lab won an innovation award for their Spatial Audio technology at CES 2023. For more detailed information on GSA, click here (also, listen to the GSA demo 🥳).

pre-image
Spatial Audio, Immersive Audio, 3D Audio

Spatial Audio,Apple has joined and it is on track to becoming mainstream. The latest iOS update (October 2020) enabled users with iPhone and Airpods Pro to experience a new feature called Spatial Audio. Once you plug Airpods Pro in your ears and watch the video on your iPhone screen, you will have an impressive experience where you can’t tell whether the sound is coming from the iPhone or Airpods. Like many of Apple’s new releases, Spatial Audio is not something new. However, I would say it will be the first year of getting hyped because Apple has entered the market.  So, what is Spatial Audio?   Spatial Audio Coding (SAC) Let’s backward to 15 years ago. Back in 2005, at the MPEG Standardization meeting, where worldwide audio experts gather to compete, the standard for Spatial Audio Coding (SAC) was on the way to set. The project was meant to create a standard that can effectively code and transmit to welcome the era of Spatial Audio. MPEG is the organization that standardizes the most popular audio codecs such as MP3 and AAC (Advanced Audio Coding) and the video codecs like H.264 and HEVC. However, back then, the standard “Spatial Audio Coding” was renamed to MPEG “Surround” (ISO/IEC 23003-1) at least in part due to a sense of unfamiliarity. (🤔Question 1: Here, does Spatial Audio and Surround has the same meaning?)   Spatial Audio Object Coding (SAOC) In 2007, the same experts of MPEG designated another standard, “Spatial Audio Object Coding” (SAOC, ISO/IEC 23003-2), for better expression of audio objects (in audio terms, object means each instrument of music performance for example) by extending the coding principle of Spatial Audio Coding (SAC). Both “Spatial Audio” standards haven’t had a piece of luck in the market and might be hibernated on someone’s hard drive.     Immersive Audio Meanwhile, the industry experts came up with the new name, “Immersive Audio” instead of “Spatial Audio,” which has faded in the market. The Immersive Audio seems to have been used to start by explaining a format with enhanced sound directions by adding ceiling speakers. And It is also being used as an audio term for VR, AR, and XR, representing Immersive Media that has rapidly emerged in the market.(🤔Question2: Does Spatial Audio have the same meaning as Immersive Audio?)   3D (Three Dimensional) Audio  3D Audio means the three-dimensional spatial sound that adds space, clarity, and depth. One-dimension (1D) is a line, such as a stereo speaker placement, that separates left and right. Two-dimensions (2D) can be defined as a horizontal plane having an additional axis to define front and back and also can be represented through 5.1 channel speakers. And, the “Surround” has been used to indicate this 2D space.  Adding height(up and down) on top of the 2D eventually leads to three-dimensions. The 5.1.2 channel and 7.1.4 channel, which you see as multichannel format these days, are three-dimensional examples. Someone can argue that a stereophonic setup can already represent 3D space. Yes, it is true but here we’d like to refer to a strict definition of the dimension with the physical speaker configurations.   MPEG-H 3D Audio (ISO/IEC 23008-3) As time flew, the audio experts at MPEG came up with a new standard, MPEG-H 3D Audio (ISO/IEC 23008-3), in 2014. The standard contains and delivers the audio signal that spans 3D space with one of the channels, object, audio scene data (called Ambisonics), and their combinations in one. 3D is such a traditional term representing this standard’s identity– nothing new and fresh. Because the term 3D audio is already being used in the 1960s… Just expected, it’s being called “MPEG-H Audio” these days. MPEG-H Audio was originally standardized to meet the audio requirements for the UHDTV applications. It’s adopted as a next generation audio format for UHDTV in many countries including the US, EU, Japan, and Korea. It is also used as a delivery format for Immersive Audio in Tidal, Amazon Echo, and more. There’s a Dolby AC4 (compression method) + ATMOS (signal format) combination as an alternative to the market. While the MPEG-H is an International Standard that encompasses format and compression     Let’s clarify Spatial Audio, Immersive Audio, and 3D Audio. To begin with the conclusion, I would say these are pretty much possessing the same meaning.   3D audio, as mentioned earlier, is the audio that represents three-dimensional space, however, the term was being used too early like ‘3D Surround’ as marketing, even before the true meaning of three-dimensional has been created by frontiers in the audio industry. As result, people have no impression of the term “3D”. Therefore, the term “Spatial” and “Immersive” came up as new terms.   Spatial Audio is already synonymous with 3D Audio as 3D means a ‘space’. However, it doesn’t give a freshness since the term Spatial was exhausted with Spatial Audio Coding in 2005 when expressing 5.1 channels or Surrounds — which are two dimensions.   Immersive is a term that expresses reality from the perspective of the person listening to the sound, instead of the technical definition of space and dimension. Because the noun Immersion means a state in which it is difficult to distinguish between reality and virtual boundaries, the Immersive Audio is such a realistic audio that gives a natural listening experience even it’s virtual, and it’s also the audio to realize “Being There,” the expression that describes virtual reality. To fulfill the Immersive of “Being There,” it’s based on 3D or Spatial Audio technically.   But, there is one more thing to consider. With the advent of VR, the listener ‘myself’ can move in three-dimensional space in the virtual world. The listener’s point of view or head-orientation changes to a three-axis (3DoF; Degree-of-Freedom) called Yaw-Pitch-Roll, and the listener’s position can further move to a three-axis of X-Y-Z. They constitute a six degree of freedom in total and the immersive audio must be provided that is not distinct from reality wherever the listener goes and sees in the space. Therefore, Immersive Audio has been accepted as the same meaning as 6DoF audio. This means it’s just another 3D or Spatial Audio in a 6DoF environment.   VR Audio, 360 Audio  So, VR Audio means audio for VR and is an Immersive Audio that guarantees 6DoF freedom. In the same manner, the 360 Audio is one of the Immersive Audio that responds to 360-degree Video – a video format with a 3DoF in response to movement of the head, referring to video taken using 360-degree cams as a subcategory of VR.   MPEG-I Immersive Audio  In 2014, the audio experts who had already completed the MPEG-H Standard soon kicked off standardizing Immersive Audio under the project name MPEG-I. It aims at realizing 6DoF audio for the era of VR, AR, and XR; however, the era has not come yet, the standard body is still wandering in the exploration stage even today.   Following the MPEG history, you see they’ve created standards in the order of Spatial Audio (2005) ➡️ 3D Audio (2014) ➡️ Immersive Audio (2022?). At this point, you’re probably wondering if these audio technology standards are all same with different names. Well, they do sound very alike but the underlying technology is different. As I summarized each of the above, the order and concept are pretty much mixed up which is likely to cause confusion in the market.   Back to Apple’s Spatial Audio If Apple would have told that they added the “Immersive Audio” feature to their updated iOS instead of “Spatial Audio,” it would have been easier to organize the term in the industry. Either way, it seems like Spatial Audio sounds much cooler than other terms nowadays because of Apple.    Apple explained in WWDC when Spatial Audio receives an audio format consisting of 5.1 or more channels of audio signals or object signals, it will get users to experience the cinematic sound through the AirPods Pro. (Update 2020-12-17: Apple further released AirPods Max which is a headphone version of AirPods and without a doubt, features the Spatial Audio again. Already sold out so you can’t get it by March 2021 unfortunately) But this is only the beginning, I have a feeling the use of Spatial Audio within Apple’s ecosystem will be expanded way further in the future.   I assume, by early 2021, the Spatial Audio feature will be found everywhere like Android-based leaders such as Samsung, Oppo, Vivo, Xiaomi, Huawei, and more.      

2020.12.18