You are here

Surround Sound Explained: Part 5

Metadata, Upmixing, Downmixing & The Centre Speaker By Hugh Robjohns
Published December 2001

There's a lot more to 5.1 surround sound production than stereo with extra channels — new, bewildering terms and concepts abound. This month, we explain Metadata, Upmixing, Downmixing, and look at what that centre speaker actually does...

surround header artworkIn Part 4, I described the basics of 5.1 surround systems, including how the loudspeakers are supposed to be configured and aligned, the idea of bass management, and the various primary commercial data-reduction formats employed in DVD and cinema surround systems. This month, we will be delving a little deeper into the subject to look at Dolby's Metadata facility, which conveys important information to the end-user's surround playback equipment. This extra data informs the decoder about how to reproduce the audio, set the replay level, adjust the dynamic-range compression, and balance the downmixing of the 5.1 information to stereo (or mono). Consequently, setting the correct metadata for a given piece of audio is as important as getting the mix right in the first place!

We will also take a few steps closer towards producing material in 5.1 surround by considering the pros and cons of using the centre channel (in contrast to a 'phantom' centre), the implications for multiple A-D and D-A converters, and the ideas behind 'upmixing' stereo material to form a 5.1-compatible version.

Introducing Metadata

Dolby Digital is rapidly becoming the de facto multi-channel audio format for the consumer, despite the alternatives from DTS and MPEG. One reason for this is the way that every consumer who receives a Dolby Digital data stream should be able to enjoy the best audio reproduction possible, irrespective of the number of channels in their playback system, or the environment in which they are listening. This is thanks to the use of Metadata.

Metadata — data about the data — is information encoded with the audio and used during replay to optimise the audio presentation for the environment in which it is being heard. In the Dolby Digital system, the metadata is defined by whoever has produced the material to govern three main aspects, some of which may also be influenced by the end-user. These three elements are:

  • To set a consistent replay level (based on the level of dialogue, hence 'dialnorm') between different media and types of programmes;
  • To determine how the dynamic range should be reduced under less-than-optimal listening conditions;
  • To control downmixing — the reduction of 5.1 channels to stereo or mono.

The first element, dialnorm, is critical to the operation of the rest of the system and cannot be affected by the end-user. The other two elements are used to convey the producer's preferences for dynamic-range reduction and downmixing. If either (or both) of these modes are selected by the end-user, the producer's preferences determine the end result, hopefully with optimal results.

Metadata is produced by the audio originator during post-production, but since Dolby Digital is a consumer medium used for DVDs, and not a production format, the metadata starts life within a different domain. In Dolby's grand scheme, the post-production environment employs an audio encoding format called 'Dolby E'.

Although still a data-reduced audio format, Dolby E is a far more benign data-compression method and has a much greater data rate than Dolby Digital, and is designed to encode up to eight audio channels into a form which can be recorded as a single (stereo) AES-EBU data stream on conventional digital video or audio recorders. Since it uses considerably less data reduction than AC3 (the perceptual coding algorithm used in Dolby Digital), it also withstands many more encode/decode cycles without audible degradation — which is clearly important for post-production.

Once the audio producers have specified the various metadata settings, Dolby E in the post-production environment has to be transcoded to Dolby Digital for mastering to DVD. Some of the original metadata configures this transcoding, while the rest is passed on to control the end-user's AC3 decoder.

Metadata & Dialnorm

One of the most frustrating aspects of all current media is that of wildly varying replay volumes — particularly between programmes on the television and radio. I'm sure everyone is familiar with the experience of being blasted out of the room when the adverts come on during a high-quality television drama or feature film. The reason for this is that the drama or film will probably have been mixed with a fairly wide dynamic range, so the volume has to be raised to reproduce the dialogue at a sensible level — but the advertisements are recorded with a small dynamic range to ensure audibility under all circumstances, and the end result is that the viewer has to continually adjust their replay volume — a job that the broadcaster should ideally be doing. The same is equally true, although less obvious, when switching between auditioning full orchestral works and pop music.

Dolby's 'Dialnorm' system allows the producer to identify the average level of the dialogue relative to the 0dBFS digital peak level, with -31dBFS being the standard. On replay, the decoder automatically sets the replay volume to establish this reference dialogue level for all programmes, irrespective of their type or dynamic range. Hey presto — no more reaching for the remote control for the adverts or when changing channels. Of course, this relies on the producers using the dialnorm facility appropriately, but its correct use is also essential for the proper functioning of the dynamic range control facilities embodied within Dolby Digital, so abusing the system has other detrimental repercussions.

Dolby Metadata

Dolby's metadata carries two sorts of data referred to as 'Informational' and 'Control'. The table below lists the full set of available parameters which may be configured in the DP569 Dolby Digital or DP571 Dolby E encoders, or via Dolby's DP570 Multichannel Audio Tool.

METADATA PARAMETERINFORMATIONALCONTROL
UNIVERSAL SET
Dialogue Level 
Channel Mode 
LFE Channel 
Bitstream Mode 
Line Mode Compression 
RF Mode Compression 
RF Overmodulation Protection 
Centre Downmix Level 
Surround Downmix Level 
Dolby Surround Mode 
Audio Production Information 
Mix Level 
Room Type 
Copyright Bit 
Original Bitstream 
DC Filter 
Low-pass Filter 
LFE Low-pass Filter 
Surround 3dB Attenuation 
Surround Phase Shift 
EXTENDED SET
A/D Converter Type 
Preferred Stereo Downmix 
Lt/Rt Centre Downmix Level 
Lt/Rt Surround Downmix Level 
Lo/Ro Centre Downmix Level 
Lo/Ro Surround Downmix Level 
Dolby Surround EX Mode 

Useful information guides on Dolby Digital, Metadata and many other aspects of surround sound can be found at www.dolby.com

Metadata & Dynamic Range Control

Setting a 'normalised' dialogue level is a fine idea, but different programme material is recorded with widely varying dynamic ranges. Consequently, even though the dialogue may be at a comfortable level, the quiet elements of a feature film may be lost in a noisy environment, while the explosions may rattle the windows! However, Dolby Digital also has built-in facilities to reduce the dynamic range to suit the replay environment, whilst maintaining the reference dialogue level. So the listener no longer needs to continually adjust the volume control — in theory at least.

Although this facility is notionally for 'dialogue normalisation', it is equally relevant for music-only programmes where it sets the appropriate replay level necessary to match the perceived volume of other material. The aim is to avoid the consumer having to adjust the volume between different material as far as possible.

The Dialnorm scale spans -1 to -31dBFS in 1dB increments. The -31 setting represents no level shift, while -1 is the maximum level change. Dolby Digital standardises the time-averaged loudness of dialogue to -31dBFS by applying a fixed level-shift to all 5.1 channels determined by the dialnorm setting. If the dialnorm setting is -31, for example, no level change is required, as the programme already conforms to the required average level. On the other hand, if the programme has a more compressed dynamic range, it may be given a dialnorm value of -21, to reflect the fact that the average dialogue level is closer to the peak level. In this case, the decoder will apply 10dB of attenuation to bring the average level of the programme dialogue back down to the standard level.

As already mentioned, the listening environment can have widely varying background noise levels depending on the time of day, current activity, and so on. Compressing the programme material to always remain clearly audible even in worst-case situations (as most popular music FM radio stations do) is hardly appropriate for a high-quality medium, and so Dolby have incorporated a user-selectable dynamic range control (DRC) to optimise the replay dynamics for the particular listening environment. In a quiet listening room with a decent monitoring system and no neighbours to annoy, the full dynamic range of the original material can be utilised. Alternatively, in a noisy environment, or late at night when you don't want to disturb sleeping neighbours, a greatly restricted dynamic range may be more appropriate — the user can choose how to audition the material, but the producer determines exactly how the DRC works — how much the loud bits are pulled down and the quiet bits lifted up — through the metadata.

Figure 1: A diagrammatic representation of Dolby's Dynamic Range Control (DRC) process.Figure 1: A diagrammatic representation of Dolby's Dynamic Range Control (DRC) process.

There are six predefined 'profiles' which may be applied, the most appropriate mode for the particular material being determined by the content producer through the metadata. These profiles are called: Film Light, Film Standard, Music Light, Music Standard, Speech, and None. The various dynamic curves ascribed to these profiles are all referenced to a 'null band' where no compression is applied and which is centred on the dialnorm value (see Figure 1, and also the 'Dynamic Range Control Profiles' box at the end of this article).

These DRC modes also exist in two forms: Line Mode, and RF mode. The former is used in equipment with built-in decoders and discrete multi-channel outputs, whereas the latter is used in equipment such as set-top boxes. These always provide a downmixed output (stereo or mono) which is typically passed on to a TV through the normal RF aerial connection.

In systems which use the Line Mode DRC, the metadata conveys information about the optimal low-level lift and high-level compression, maintaining the dialogue at the nominal -31dBFS level, but reducing the dynamic range about that point. Some equipment will also allow user-determined scaling of this DRC metadata to provide intermediate dynamic range settings.

The RF Mode has to maintain comparable sound levels with typical off-air TV broadcasts, and so an 11dB boost in overall level is applied, bringing the standardised dialogue level to a nominal -20dBFS. The full amount of dynamic range reduction is also applied, usually with some protective peak limiting too. This maximum level of compression is also often used in computers for DVD playback over small loudspeakers.

Real & Phantom Centre — What's The Difference?

It is worth pointing out that just because there is a centre channel in 5.1 surround, you don't have to use it! That may sound a bit odd, but there is reason in what sounds like madness.

The 'phantom' centre image.We are all familiar with the idea that any instrument required to occupy a central position in a stereo sound image has to be reproduced equally by both left and right speakers, producing a 'phantom image'. This approach has been used for over 70 years and works tolerably well, but has a few drawbacks. For a start, the phantom image pulls towards the nearer loudspeaker if the listener moves away from the centre line between the two speakers. Secondly, since there are two sets of drivers energising the room from different places, the sound of a phantom image — particularly at the bottom end — is radically different to the sound that emanates from a single central speaker. This is the reason why mono checking should always be performed on a single loudspeaker rather than a phantom image produced by a stereo pair.

In contrast, the dedicated centre speaker of a 5.1 system is a physical sound source — sounds reproduced by it stay in a fixed spatial position regardless of the location of the listener, thus giving more stable imaging over a wider listening area. Secondly, a signal routed to only the centre speaker will sound exactly the same as it would if routed to only the left or right speakers — unlike a phantom centre image, where the quality changes quite dramatically as a sound is panned from left, through centre and on to right.

So, on the face of it, the dedicated centre channel seems to offer several advantages... but there is one serious problem, and that is that the end-user may not have set it up correctly! In any domestic situation, it is highly unlikely that all five speakers will be in the correct ITU recommended positions relative to the listener (as explained last month), and the centre and rear channels will probably be operating with different sensitivities to the left and right speakers.

How can this happen? Well, most 5.1 consumer systems provide facilities to 'fine-tune' the balance of the loudspeakers relative to one another, and most inexperienced listeners crank up the rear channels (just to emphasise the surround aspect of their new system), while a good few also turn up the centre channel to make dialogue clearer in feature films! Consequently, a balance which sounds great on a properly calibrated monitoring system can be all over the place in a domestic listening environment.

Although imprecise levels in the rear channels is rarely disastrous, the centre-channel level is critical, as this is where the main vocals are usually placed. If its level has been maladjusted, the balance between vocals and the other instruments — the hardest but most important aspect to perfect when mixing — will be incorrect, to the obvious detriment of the material. To overcome this uncertainty, many music balancers are returning to the use of phantom-centre images reproduced by the left and right channels, which always have a fixed level relationship to one another. With this technique, the balance between the vocals and other instruments is 'locked in' and can't be messed up by the end-user!

This approach also opens the door to some creative effects — the phantom image tends to be perceived as being in front of any sound emanating from the physical centre speaker (see diagram above), and so interesting layering effects can be achieved very simply by routing some sounds to the centre speaker and others to both left and right to make a phantom centre.

Another important aspect of using phantom-centre images is that of power handling and room coupling at low frequencies. Placing the kick drum and bass guitar in the centre channel alone will sound different from using a phantom centre — although whether this is an advantage or not depends on personal preferences. The centre-channel image will also not have the same dynamic headroom capability — a clear case of two speakers being better than one — although the use of the LFE (subwoofer) channel can counteract this problem to a degree.

Depending on the mixer being used, it is usually fairly simple to contrive a method to route individual signals either to a dedicated centre speaker channel, or to both left and right channels simultaneously to produce a phantom centre. Switching between the two routings quickly reveals their different sound characters and, within the context of the rest of the mix, the different spatial depth effects also become apparent — something which is well worth trying for yourself.

Metadata & Downmixing

Downmixing allows end-users to replay a version of the 5.1 source material even when a full 'home-theatre' 5.1 monitoring system is unavailable (see Figure 2). Equipment designed to accept Dolby Digital material, but which provides only mono or stereo outputs (eg. portable DVD players, set-top boxes and so forth), incorporates facilities to downmix the original 5.1 channels to the one or two output channels as standard. To ensure that this downmixing provides the best possible compromise, the material is checked for stereo and mono compatibility during mixing and post-production, and the 'instructions' for optimal downmixing are incorporated into the metadata.

Figure 2: Simplified downmix processing.Figure 2: Simplified downmix processing.The Dolby Digital decoder can provide two versions of stereo downmixing, which are known as Lt/Rt and Lo/Ro, or 'surround' and 'stereo' respectively. The first is a Dolby Pro Logic-compatible version with matrixed centre and mono surround channels within the left and right information (see part two of this series for a refresher on Dolby Pro Logic). The second is a straight stereo version, although the original centre and surround information may be recombined with the left and right channels in some way. If a mono output is provided, it is derived from the sum of Lo/Ro. When downmixing, the metadata determines the most appropriate levels to incorporate the surround and centre channels with respect to the left and right front channels.

By default, the Dolby Digital encoder introduces a 90-degree phase-shift between the two surround channels, so that when they are downmixed and the left surround is added to the left front, and the right surround to the right front, they automatically produce a Pro Logic-compatible matrixed surround channel. This phase-shift, although not required in a discrete 5.1 system, is usually inaudible with most material. However, it can become apparent with some music-only recordings, so the phase-shift can be disabled during the encoding if necessary. This is then flagged in the metadata so that it is then impossible for the decoder to produce an Lt/Rt downmix. Clearly, it is therefore important to check the Lt/Rt and Lo/Ro downmixes of the material at the mixing stage, and flag the best compromise in the metadata.

The downmix metadata is encoded in two parts — Universal and Extended Bitstream Information (BSI). Only the latest decoders are able to utilise the Extended parameters, which allow even more precision in balancing the downmix components. Provision is made in the Universal BSI to set the level at which the centre channel is combined with the left and right channels. The default is -3dB, but -4.5 or -6dB can be specified as alternatives. Similar facilities are available for combining the surround channels with options of -3dB (default), -6dB, or off altogether.

The Extended BSI provides a broader range of downmix parameters. Included in these is provision for the programme producer to flag a preference for Pro Logic Lt/Rt or straight stereo Lo/Ro downmixing, and the centre and surround channel mix levels can be set far more precisely. The options here range from +3 to -6dB in 1.5dB increments, plus an 'off' mode — and different levels can be set independently for the Lt/Rt and Lo/Ro downmixes too. There is even a flag to indicate when the 5.1 surround channels are encoded with the Dolby EX format to provide a rear centre channel (for more on Dolby EX, see Part 4 of this series).

Within this metadata, Dolby have also incorporated many flags to indicate which channels are active (2/0, 3/1, 3/2 and so on, as explained last month), and also what the content of the audio data stream is. For example, the audio data may represent the complete main mix (CM), or a music and effects mix (ME) which, in a DVD film, would be used in conjunction with a separate dialogue channel (D) — the latter possibly being available in different languages. These may have to be combined appropriately before output and downmixing. There may also be flags for narrative soundtracks for the visually impaired (VI), and increased intelligibility tracks for the hearing impaired (HI). Numerous other options can be provided, including additional commentary channels and the dreaded karaoke mode!

Multi-channel Converters & 5.1 Surround

An issue which is not, perhaps, immediately apparent when working in 5.1 surround is that of converter latency. Any A-D or D-A converter takes a finite time to convert audio between the analogue and digital domains and, in general terms, the higher the oversampling rate of the converter, the longer that conversion time will be — around one millisecond is not atypical. When working in stereo, this latency only becomes significant when using analogue inserts from a digital console — but is rarely a problem even then.

However, expanding an existing recording or mixing system from a stereo arrangement to 5.1 creates some traps in addition to the obvious expenses of additional speakers! Clearly, additional A-D converters will be required to digitise the extra output channels (centre, sub, left and right surround) above and beyond the original Left-Right pair.

The problem is that unless the converters are all the same model from the same manufacturer, they will inevitably exhibit different processing latencies, which will cause small delays between the channels (or pairs of channels if multiple stereo converters are employed). While this problem may remain inaudible when auditioning discrete channels throughout the recording and mixing, the downmixed combination of channels will almost certainly exhibit undesirable comb-filtering or phasing effects. It is therefore essential to use perfectly matched A-D converters when digitising the left, right, centre, subwoofer, left surround and right surround channels. Most converter manufacturers are now offering six- or eight-channel converters specifically to address this issue.

Of course, if you are mixing for 5.1 with a digital console, the onboard A-D converters handle only individual source channels. These are then routed to the various output channels entirely within the digital domain, thus avoiding any inter-channel timing disparities. Ideally, the D-A converters will be matched to each other too, although this is less critical, as timing discrepancies here will only affect imaging precision in the particular monitoring environment, rather than affecting the 5.1 master and its downmixing compatibility.

Upmixing

Having just described Dolby's downmixing provisions, it seems appropriate to discuss the concept of 'upmixing' here, even though upmixing is not part of Dolby's metadata technology. Upmixing is the process of taking conventional stereo material and reformatting it for release as 5.1 surround material — much in the same way that mono material was often 'repurposed' for stereo release (or 'stereoised' in American terms) in the 1950s and '60s.

There are lots of ways of expanding stereo material to 5.1, but most techniques are based on those developed for converting mono to stereo. The most common approach seems to be to allocate the original stereo channels to the left and right channels of the 5.1 array, and then derive centre and surround channels. The derivation of centre and surround channels may remain unchanged throughout the track (the usual technique for classical music), or may change — perhaps to emphasise specific sections, such as the choruses in popular music. Of course, by retaining the original left and right channels, the end-user can easily get back to the original stereo if they don't like the upmixed surround version.

Figure 3: A diagrammatic representation of a typical upmix procedure.Figure 3: A diagrammatic representation of a typical upmix procedure.

Techniques vary widely, but the centre channel is typically a mono sum of left and right, possibly with a little equalisation and maybe a touch of extra delay. The surrounds are usually derived from left minus right (and vice versa), again with some delay and EQ, and possibly some extra reverb (see Figure 3). Much hushed talk surrounds the production of upmixed 5.1 mixes from stereo, but as you can see, it's not really rocket science, merely a way of arranging material from a completed two-channel stereo mix such that it plays back over all six channels of a 5.1 playback system. There are also several products on the market now — such as TC Electronic's System 6000 — which incorporate dedicated algorithms specifically to convert stereo material to 5.1. The results depend on the source material, but often the upmixed version is very credible and pleasant to listen to.

While better results may be obtained by going back to the original multitrack tapes and remixing directly to 5.1, few recordings were made with a multi-channel release in mind. Most were recorded specifically for stereo and might have been done rather differently for a multi-channel release had that option been available at the time. Consequently, remixing for 5.1 is inherently compromised to a degree. Mike Oldfield's Tubular Bells is a good example of the opposite situation, since the multitrack recording was originally intended for quadraphonic release and the stereo version was the compromise! It has recently been re-released in its original quad format on SACD.

Another reason for upmixing is that the record companies often can't or won't remix the multitracks, either because the original tapes are damaged or lost, or because they don't feel that the necessary time and money can be justified for a full multitrack 5.1 remix. DVD-Audio and SACD multi-channel music releases are still in their commercial infancy, and no one wants to commit vast post-production budgets without guaranteed returns. We will take another look at upmixing in more specific detail in later parts of this series.

In Part 6...

Next month, Paul White takes over the sweet spot to discuss how you can use or adapt your existing equipment so that you can work in surround. We'll also be talking to some of the engineers working on surround projects in big-name studios to see how they go about mixing for 5.1.

Surround Sound Explained: Part 1 Foundations

Surround Sound Explained: Part 2 Dolby Pro Logic

Surround Sound Explained: Part 3 Ambisonics

Surround Sound Explained: Part 4 5.1 Surround

Surround Sound Explained: Part 5 Metadata, Upmixing, Downmixing & The Centre Speaker

Surround Sound Explained: Part 6 Setting Up A Surround Recording System

Surround Sound Explained: Part 7 Mixing In Surround

Surround Sound Explained: Part 8 Surround Production

Surround Sound Explained: Part 9 Surround In Your DAW

Dynamic Range Control Profiles

The six DRC profiles are as follows:

NONE

No DRC profile is selected, but dialnorm is still applied, and the full dynamic range of the material can be enjoyed, assuming a suitably capable monitoring system.

FILM LIGHT

  • Max Boost: 6dB (below -53dB).
  • Boost Range: -53 to -41dB (2:1 ratio).
  • Null Band Width: 20dB (-41 to -21dB).
  • Early Cut Range: -26 to -11dB (2:1 ratio).
  • Cut Range: -11 to +4dB (20:1 ratio).

FILM STANDARD

  • Max Boost: 6 dB (below -43dB).
  • Boost Range: -43 to -31dB (2:1 ratio).
  • Null Band Width: 5dB (-31 to -26dB).
  • Early Cut Range: -26 to -16dB (2:1 ratio).
  • Cut Range: -16 to +4dB (20:1 ratio).

MUSIC LIGHT

  • Max Boost: 12dB (below -65dB).
  • Boost Range: -65 to -41dB (2:1 ratio).
  • Null Band Width: 20dB (-41 to -21dB).
  • Cut Range: -21 to +9dB (2:1 ratio).

There is no early cut range in this profile.

MUSIC STANDARD

  • Max Boost: 12dB (below -55dB).
  • Boost Range: -55 to -31dB (2:1 ratio).
  • Null Band Width: 5dB (-31 to -26dB).
  • Early Cut Range: -26 to -16dB (2:1 ratio).
  • Cut Range: -16 to +4dB (20:1 ratio).

SPEECH

  • Max Boost: 15dB (below -50dB).
  • Boost Range: -50 to -31dB (5:1 ratio).
  • Null Band Width: 5dB (-31 to -26dB).
  • Early Cut Range: -26 to -16dB (2:1 ratio).
  • Cut Range: -16 to +4dB (20:1 ratio).