Reinventing music with VR: Personal and Interactive audio in full 3D space

2017.05.24 by Henney Oh

Reinventing music with VR: Personal and Interactive audio in full 3D space



Early critics thought jazz totally ruined the music that came before it. And that rock and roll did after that. Don’t even get started on rap and hip hop. New musical styles frequently face fierce criticism when they first hit the scene, but they bravely fight on to drive music, style, and culture forward in the face of those who oppose them. New technologies also tend to receive a comparably warm welcome. Well, keep your torches and pitchforks at the ready, because you may not like what’s coming next.


3D audio and virtual reality can totally change the way that people experience music. VR headset users have the freedom to look wherever they want. This actually isn’t anything new — you’ve always had the freedom to look wherever you want. The only thing that’s changed with VR is that now instruments can truly be placed all around you. In traditional stereophonic setups or multi-channel loudspeaker mapping scenarios, sounds are only located in front of the stage or screen to correlate with a person’s natural viewing angle. Even if the speaker configuration is increased to 5.1 or 7.1, the speakers at the back only create a general sense of ambiance, since there’s no corresponding visual source behind the audience. We briefly touched on these ideas in this article featured on AR/VR Magazine.


Placing sound sources in full 3D space not only increases variety, but it can also increase localization accuracy by freeing sources from a restrictive physical speaker configuration. However, just because sounds can be placed anywhere does not mean that they should be placed everywhere. This reinvented music can be compelling, but it won’t be without some challenges of its own.


Positioning sound sources in full spherical space

It’s one thing to talk about it on such a high level, but what would this new experience actually feel like? Picture yourself at the central point of a fully three-dimensional space, while performers and instruments surround you in every direction imaginable. In this vision, you are more immersed in the experience than ever before, and it overcomes the limitations of conventional music creation and reproduction, which is at best panned 180° in front of you.


Listening to this kind of music in spherical space is no longer a fantasy thanks to VR, specifically if the audio is from an object-oriented mix. In an object-oriented mix, sound sources — instruments, vocals, ambient sound or any combination — can be placed anywhere in a 3D space with azimuth, elevation and distance information for each object. Each of these sound points are then rendered and projected through “virtual loudspeakers.” There can be as many virtual speakers as there are objects, as long as the right format is used. You may have thought you were listening to “surround sound” in the past, but sound objects were never actually surrounding you. The promise of full three-dimensional sound is only truly fulfilled in VR.




An expanded 3D canvas requires more than just matching image and audio

Before all else, sound must be spatialized accurately. The audio has to match the image as closely as if the sound source was attached to the visual object moving about the scene. The process of incorporating that positional data to sound sources is key, and since a VR headset user’s head orientation should be always be accounted for, the need for a good renderer is paramount. These post-production and consumption tools for this new medium are vital building blocks in creating the experience, but creators must wield them correctly if they want any chance of creating a transformative experience.


The industry is ready to graduate from simply matching the audio with the visuals. The next level of our VR education will be attempting to place sounds in every position imaginable. If we look again to history as our template, instrument placement in orchestras has been fine-tuned over hundreds of years. We will likely need a similar approach for our virtual canvas, which is expanded 360 degrees horizontally and 360 degrees vertically beyond the physical dimensions of a stage in real life. These new 720 degree settings are the Wild West of the musical frontier, and we’ll need a new set of operating guidelines to make the most of them.


The new guidelines should help artists use the creative tools correctly by addressing their use with the added dimension of space to consider. In VR, a song is not just about framing the timing of the elements, but about spatially framing them as well. Composers used to only need to care about each instrument’s pitch, loudness, and timing. They still have to care about those, but now they also have to take each instrument’s “virtual location” into consideration. Sheet music will be ineffective since three-dimensional data must be attached to each sound. Where to place a sound is totally up to them, which presents an unprecedented challenge along with unprecedented opportunities for inspiration. Instruments can be placed near each other to balance the harmony or totally separate to strategically convey a certain intention.


Many questions have been raised about instrument location, but they can be asked again regarding the listener. If it is an interactive VR piece where six degrees of freedom is possible, the listener’s position can change. This leaves the creator with plenty of power and a difficult dilemma. Should they give the audience the same unrestricted freedom they’ve become accustomed to in VR at the risk of missing the artist’s intent? Or should they use cues and restrictions to deliberately designate the listener’s position? It’s clear that these ‘spatial’ aspects are factors to consider during the creation stage well before the music ever reaches the audience’s ears.


Exploiting human auditory perception

You might not know it, but you’re already wired to take advantage of some elements that are pretty unique to the VR music model, and the shift from a channel-oriented approach to an object-oriented one further opens the door for a more personal and interactive experience. In one example of this, a fan who loves drums could potentially hear the drums louder than any other instrument in the song. This sounds a little farfetched, but there have already been steps to explore this phenomenon. In 2010, the Moving Picture Experts Group expounded on a signal compression method that details how a listener can manipulate sound to hear one audio signal over another on the device. This is now referred to as MPEG Spatial Audio Object Coding (SAOC).


Listeners in VR can control their sound experience depending on personal preference even without SAOC simply by looking at a specific item, as long as the mix is delivered to the device in an object-based format. We could conceivably take this idea one step further and make looking at the object fully interactive — if a VR headset user looks at one object long enough, it would indicate that the user wants to hear it up close, and the loudness of the object would increase as a result. I might like to watch Imagine Dragons music 40° to the right to listen to the guitar more but you might like 120° to the right so you can hear the vocals louder.


VR is giving this powerful option of controlling which sound you want to hear to the user — making audio more personal and truly interactive. But do these ideas of auditory individualism have any foundation outside of VR? How can you hear one sound more than another just by looking at it?


Psychoacoustic principles can help answer some of these questions, and one principle that will be prominent in VR environments is Binaural Masking Level Difference. BMLD explains that you can hear a smaller sound over a larger one as long as they are spaced out. Let’s say there are a loud bell and a softly chirping bird. If they are located in the same position as each other, you can’t hear the bird, a concept called spatial masking. However, if they are located in different positions, that same quiet bird is spatially unmasked and becomes audible.


Meanwhile, another psychoacoustic theory commonly known as the “cocktail party effect” will also be influential. If you’ve ever found yourself in a loud and crowded room, but able to carry on a conversation with someone very interesting to you, you have been the beneficiary of this phenomena. If you can determine the direction and thus the location of a sound object, you can single out its sound to hear it more clearly than the rest. Using the same objects in the above illustration, you would be able to hear the bird over the bell if you know its direction and pay more attention to it, as long as bird and bell are spaced out.


We have seen proof of both BMLD and cocktail party effects thanks to our recent 360 music video experiment with Jambinai. In the scene, there are several musical instruments, or objects, placed in separate locations. If you pay attention to one instrument (object), the others blend together into something more like background sound. Now that you are focused on one object, a different perceptual loudness is delivered to you, and you can clearly hear that specific object sound. If Jambinai was downmixed into a standard mono mix, you would hear well-balanced sounds for each instrument, but they would lose a significant amount of detail. Since the information for each instrument is preserved in an object oriented mix, a listener can single out each instrument and hear its details. Amazingly, this leads to a different musical experience for each individual, even though the musical piece is exactly the same.


However, this level of customization could be a point of contention because creators ranging from composers, players, sound engineers, producers have mastered their song in a specific way. So some critics will ask “why should the composition be suspect to corruption by the listener?” That’s a fascinating topic for another time. What needs to be understood for now is that it has become easier for a listener to focus on one sound object over another, which is a huge jump from the stereo-oriented era.


Many enthusiasts contend that the balcony, not the middle or the front, is the sweet spot to listen to an orchestra. If you are creating VR content that lets users explore the scene on their own, then finding that perfectly harmonized spot may not be the ultimate goal. We are at the frontier of a new wave of music — an unblazed trail waiting for those VR audio engineers who are brave enough to pave the way forward.

How To: Producing immersive audio for VR

How To: Producing immersive audio for VR The key to creating realistic audio for this is to synchronise sounds according to the user’s head orientation and view in real time. This helps replicate an actual human hearing mechanism, which makes the listening experience more realistic.  Producing truly immersive sound requires several steps. First, you must capture the audio signals, then mix the signals and finally render the sound for the listener.   Capturing Ambisonics is a technique that employs a spherical microphone to capture a sound field in all directions, including above and below the listener. This requires placing a soundfield microphone (also known as an Ambisonics or 360 microphone) somewhere near the position where you intend to listen to. Keep in mind that these microphones will record a full sphere of sound at the position of the microphone, so be strategic with where you place them. It’s also important that your mic is not spotted in the scene, so we encourage placing the microphone directly below the 360 camera. In addition to capturing audio from a soundfield microphone, content creators also need to acquire sounds from each individual object as a mono source. This enables you to attach higher fidelity sounds to objects as they move through the scene for added control and flexibility. With this object-based audio technique, you can control the sound attributed to each object in the scene and adjust those sounds depending on the user’s view.   Mixing Combining object, Ambisonics and channels (like traditional 2.0 if needed) and balancing them plays an important role in mixing and mastering 3D audio. If you captured the object and the Ambisonics together, be sure that the Ambisonics signal already contains the objects. You may need an additional process to remove or balance those object tracks to ensure they aren’t counted twice. With VR and 360 video content, you not only need to consider the actor’s mouth movements but also carefully place the sound according to the position of the actor on the 360 screen, which requires a new and more dedicated sound mastering tool. Specifically, it’s now important to use a tool that lets you edit as you watch, so that while watching the visuals, you can match the sounds accordingly in both space and time.   Rendering Historically, content creators relied on DAWs for everything from mixing to mastering into a target layout. So the output of a DAW was a pre-rendered sound bed. However, with VR, sound rendering must take place on the listeners’ end, which, in this case is the actual VR hardware and is most frequently a head-mounted display (HMD). All of the possible scenarios have to be processed through HMD devices, which can require a huge amount of additional processing power. As such, while it still maintains higher quality, minimising latency as well as the amount of computation power needed when rendering is key. A benefit of the renderer being on the listener’s end is the possibility for unprecedented levels of personalisation. Keep in mind that with a conventional pre-rendered bed, you can’t variate its rendering according to each user. However, personalisation is still a long way out as measuring an individual’s personalized HRTF is still an expensive and time-consuming process.

CES at a Glance

CES at a Glance We started off the new year in full swing exhibiting at CES, the largest consumer tech show held in Las Vegas. This year’s show brought nearly 200,000 attendees from across the globe to the city’s infamous strip, along with the industry’s biggest innovators, influencers and trendsetters that set the stage for this year’s emerging technology. From corrective eye lenses for dogs to a self-driving suitcase, the number of entertaining and out-of-the-box gears and gadgets were as endless as the traffic lines of Ubers and buses shuttling people to and from the show’s nine locations. However, G’Audio spotted numerous exciting trends in the spatial audio field to keep an eye on this year.   Here’s a list of trends and emerging tech in 2018: Soundbars Soundbars seem to be taking the spatial audio industry by storm for its boasting capability to produce 3D sounds without using multiple speakers, like 5.1, 7.1, or even 9.1 surround sound. German audio company, Sennheiser, showcased its soundbar prototype for home theater 3D audio. The Ambeo 3D Soundbar includes 13 speakers, nine across the front and two angled on the top. The prototype doesn’t have a release date just yet, but is expected to hit the market by the end of the year. Fraunhoufer’s upHear’s spatial audio microphone processing technology is now able to playback on soundbars, enhancing the immersive audio experience (please note this is solely an algorithm and the company does not produce hardware for soundbars). Other audio companies that showed soundbar technology include Qualcomm, Creative Labs and SotonAudio Labs.   Holographic Displays Holograms used as visual displays made multiple appearances this year, making us feel like we had stepped into a futuristic 3018 instead of the present 2018. British startup, Kino-mo, had holographic adds floating above Eureka Park, the show’s startup sanctuary (where we were also exhibiting in the AV section). Other holographic tech companies included Merge VR’s Holo Cube that lets users interact with holograms, and Hologruf, which showcased 3D holographic displays.   Augmented Reality AR has seen a spike in the past few months, where industry bigwigs, such as Microsoft, Amazon, Dell, and even StubHub have invested in the technology’s future promise of popularity among consumers. The G’Audio team checked out some interesting AR booths at Eureka Park that included both hardware and software, such messenger apps.   Virtual Reality Although VR has experienced many ups and downs this past year, CES proved the industry is still advancing, offering better quality content, new technology and more affordable hardware.   Here are some exciting VR tech updates: Tobbii Eye Tracking is pretty self-explanatory. The company was integrated into HTC Vive and can immensely improve reaction times in VR. NextVR, a fellow SoCal company, has improved its livestreaming video solution that’s compatible with most HMDs and is working towards incorporating 6DoF into the new resolution. Contact CI Haptic Gloves simulate the sensation of touching by recreating the way in which human hand muscles move.   What’s New in the World of VR Headsets There were many new HMDs showcased throughout the week. The biggest advancement for HMDs in VR was HTC Vive’s announcement of the HTC Vive Pro and Vive Wireless Adapter is more user-friendly, offering higher-resolution and lighter weight. Kopin Elf Reference Design displayed its super slim reference VR headset that includes 2K OLED. Huawei announced its VR headset, VR 2, which will support an IMAX virtual giant screen experience and will debut in China this January. Pimax released its ultra-wide 8K resolution HMD, as well as iQIYI Intelligence’s QIYU VR II that also supports up to 8K resolution and is able to recognize VR content in various code formats. Additionally, Facebook teamed up with Xiaomi and Qualcomm for the new Oculus Go headset.   Going Beyond Gaming and Cinematic VR VR is a multifaceted medium that goes beyond the entertainment and gaming sectors, and can be used for a multitude of purposes, including education, healthcare and even training. Many of these companies exhibited at CES this year. In particular, G’Audio enjoyed visiting Looxid Lab, which provides sensors to detect brain waves, and eye tracking to analyze the emotions of users, as well as what direction the user is looking in the VR content. Additionally, an analytics service is offered to advertisers or research companies for marketing purposes. We’ve also seen an increase in VR as a tool for training employees, as seen at a UK startup that specializes in creating industrial training programs developed in Unity.   G’Audio at CES We announced some exciting news of our own at CES. We’ve launched our livestreaming audio renderer! Sol Livestreaming audio solution for 360-degree video transcends the limitations of current livestreaming formats. Now, users can fully experience the sound of live entertainment and sports events without physically being present and as if they have the best seat in the house. Specifically designed to serve up Ambisonics audio signals, Sol Livestreaming provides an accurate sense of the entire soundscape. Furthermore, we’ve adapted our GA5 format for livestreaming and squeezed B-format Ambisonics into the popular AAC codec. Utilizing this ubiquitous codec allows Sol Livestreaming and its renderer to be easily adopted across multiple platforms, bringing truly immersive audio to content creators and consumers alike. Overall, CES was an action-packed week full of new and exciting technology, but most importantly, we saw numerous booths in the spatial audio realm, showing the industry is continuing to grow and advance. We can’t wait to see how this year will unfold!