Metadata Makes Object-Based Audio Mixing Possible

FKT Magazin 4/2021

With the advent of immersive audio mixing  using codecs like Dolby Atmos and DTS:X (the successor to DTS HD) professionals now have the ability to create interactive, personalized, scalable and immersive content by representing it as a set of individual assets together with metadata describing their relationships and associations. This is called Object-based audio mixing and its adding a new dimension to multichannel mixes for television and film. Some say it helps create a multi-dimensional sound experience for the viewer that moves around you like sound would in real life.  

Object-based media allows the content of programs to change according to the requirements of each individual audience member. The ‘objects’ refer to the different assets that are used to make up a piece of content. These could be large objects – the audio and video used for a scene in a drama – or small objects, like an individual frame of video, a caption, or a sound effect. By breaking down a piece of media into separate objects, attaching meaning to them, and describing how they can be rearranged, a program can be changed to reflect the context of an individual consumer.

Formats like Dolby Atmos will often use a combination of linear tracks, often called bed tracks in the Dolby Atmos workflow combined with individual sounds, called objects, where each object has matching metadata that tells the system where in the sound field to position that individual sound. This enables you to represent the content in a way that suits the playback environment, whether it is a huge cinema theatre or someone’s front room. Also, if it just a stereo mix or binaural it can use the metadata to deliver a mix for that delivery environment. The Dolby system takes the metadata and in conjunction with its knowledge of the speakers attached to it will play back the full program, bed tracks and objects so that the mixer’s intent is respected as well as the space that it is being played back in.

However, object-based audio is not just about Dolby Atmos and DTS:X. It is possible to use object audio to deliver content to the end user where they can adjust the balance between content elements. Because MPEG-H audio also offers interactive and immersive sound, employing the audio objects, height channels, and Higher-Order Ambisonics for other types of distribution – including OTT services, digital radio, music streaming, VR, AR, and web content. Dolby and others are now offering personalized audio delivery systems based around the MPEG-H audio standard enabling the end user to choose what they want to hear or not hear. For example in tennis, maybe you don’t want to hear the shrieks from a player? Thanks to embedded metadata triggers, consumers will have the option to turn that down.  

Audiences want to watch (and listen to) content everywhere, and with mobile devices, they might start watching or listening to a program at home and then finish the rest on the bus. Leveraging metadata, object-based media allows the mixer to specify different audio mixes for different environments. If people are listening on the move, with object-based audio the mixer can make sure that the sound is just right for them, no matter where they are.

Audio becomes an object when it is accompanied by metadata that describes its existence, position and function. An audio object can, therefore, be the sound of a bee flying over your head, the crowd noise, commentary to a sporting event in any language. All this remains fully adjustable on the consumer’s end to their specific listening environment, needs and liking, regardless of the device.

In the UK the BBC has been experimenting with object-based audio, which has led to a new ITU recommendation (ITU-R BS.2125 “A serial representation of the Audio Definition Model”), which was published in February 2019. It outlines a specification for metadata that can be used to describe object-based audio, scene-based audio and channel-based audio.

Another important element in delivering object-based audio to the consumer has been the development of the MPEG-H Audio standard. MPEG-H Audio is already onair in Korea and the US (ATSC 3.0), Europe (DVB UHD), and China. MPEG-H was developed by Germany’s Fraunhofer IIS research institute and is an audio system devised for delivering format-agnostic object-based audio.

Fraunhofer IIS has demonstrated an end-toend production to consumer system that includes MPEG-H monitoring units for real-time monitoring and content authoring, post-production tools, MPEG-H Audio real-time broadcast encoders, and decoders in professional and consumer receivers. With MPEG-H it is possible to offer immersive sound that increases the realism and immersion in the scene, as well as the use of audio objects to enable interactivity. This means viewers can personalize a program’s audio mix, for example, by switching between different languages, enhancing hard-to-understand dialogue, or adjusting the volume of the commentator in sports broadcasts. Along with Dolby’s new AC-4 format, which natively supports the Dolby Atmos immersive audio technology, MPEG-H is expected to have a significant impact on broadcast delivery services over the next two years.

Several production companies have developed metadata tools for automatic mixing that are both channel and object-based. These are focused on live sports, where a machine learning engine can automatically create a mix of the on-pitch sounds without any additional equipment, services or human input. This frees up the sound supervisors to be able to create better mixes. In addition, audio equipment vendors are now developing compatible products and beginning to see interest from their customers.

At the end of the day, object-based audio offers the consumer a lot more control while also providing content providers with the technology to deliver one stream of object-based content and then use the metadata to render the most appropriate version for the hardware the consumer is using to playback the content. There are still many issues to work out – like the challenge of deciding what are objects and what remain beds in a Dolby Atmos or DTS:X mix – but with time and experimentation, the promise of true personalization for the consumer, using object-based mixing – and metadata – will be welcomed by all.