The Moving Picture Experts Group

USAC Separates Speech from Audio for Improved Compression

by Philip Merrill - October 2017

About 10 years ago MPEG Audio experts took on the challenge of enhancing MPEG-4 Advanced Audio Coding (AAC) with a speech-encoding path, enabling source material to be encoded in a manner best suited to its instantaneous characteristics with consequent bit savings. Video and audiobooks frequently combine speech with music plus sound effects on a soundtrack, and the greater efficiency provided by these new speech tools targets cellular networks where there is a need to conserve wireless bandwidth. From responses to the Call for Proposals, a joint submission from Fraunhofer IIS and VoiceAge was selected in 2008, which incorporated efficient speech-encoding techniques taken from the Adaptive Multi-Rate Wide Band Plus (AMR-WB+) codec developed for cellular service. Technical collaboration proceeded through 2011 integrating enhancements from Dolby, NTT Docomo, Panasonic, Philips, Samsung and Sony. Unified Speech and Audio Coding (USAC) became an international standard in 2012 within MPEG-D and provides a significant improvement over MPEG-4 High-Efficiency AAC version 2 (HE-AAC v2) at low bit rates.

The decoding model in USAC consists of three paths between the de-multiplex of the transport signal and the decoded frames of content. The first is essentially the AAC path, the second is AAC with some tools replaced by more efficient tools from speech coding and the third path is a speech coder similar to that in AMR-WB+ . The optimal coding path is dynamically selected on a frame-by-frame basis in response to the instantaneous characteristics of the signal (e.g., music-like or speech-like). The paths communicate side-information as relevant. In order to achieve seamless switching from the AAC path to the Speech Coding path, a new tool was developed for USAC called Forward Aliasing Cancellation. In addition, USAC provides additional tools to compress higher frequencies and stereo signals using highly efficient parametric representations.

Speech signals in speech encoding systems such as AMR-WB are broken down into three sets of information, a short-term prediction, a long-term prediction, and the excitation patterns that cannot be predicted. The coefficients of the short-term linear predictive coding filter (LPC) models the human vocal tract. The long-term predictor (LTP) models the periodic fluctuations in vocalization and generates an adaptive codebook to represent the predictable component in the vocalization, or excitation, signal. A separate innovation codebook is used to represent the unpredictable component in the excitation signal, using Algebraic Code Excited Linear Prediction (ACELP). These three perform well together for speech at low bit rates but are not designed for all-purpose audio compression and produce poor results encoding music.

For higher frequencies, USAC took Spectral Band Replication (SBR) from 2003 work completed for MPEG-4. This efficiently codes the upper frequency band of the source material based on a half-bandwidth, downsampled version of the source. The result has its envelope-adjusted and can be further modified by adding audio tones or noise. Predictive Vector Coding (PVC) represents the difference between the predicted and desired envelopes using a 7-bit representation.

Parametric Stereo (PS) overcomes discrete stereo's reliance on two channels, for example as mid/side coding, and instead transmits a mono downmix and parametric side-information that is used to reconstruct a stereo signal retaining the important perceptual cues that humans use to perceive spatialization. This efficiency is most desirable at low bit rates and is an example of USAC's flexible compression tools, since when a higher bitrate is available, two channels of sampled data are generally preferable. Unified Stereo Coding in USAC can produce a fuller reconstruction by coding and transmitting residual waveform signals.

Listening tests confirmed the effectiveness of USAC in 2011 using 24 test signals, eight that were speech only, eight music only, and eight combining music with speech. The results confirmed distinctively improved performance for USAC as compared to either HE-AACv2 or AMR-WB+ at low bit rates.  At higher bit rates, the performance curves of USAC and HE-AACv2 (operating in AAC mode) converge indicating that USAC's performance is fully equivalent to (or even slightly better than) that of MPEG-4 AAC when enough bits are available.

Although our broadband-enabled world provides more bandwidth than ever before, mobile still has throughput limits, particularly in some service areas in which only 2G service might be available. The popularity of video eats up a great deal of bandwidth, and desire for more immersive audio (with additional audio channels), or multi-lingual programs, or richer programs with additional descriptive commentary puts ever increasing pressure on limited bandwidth transmission channels.

USAC was created to meet the needs of broadcast and streaming media providers and benefits from the gains in compression efficiency that can be achieved when sufficient time, manifested as broadcast latency, is available to optimize encoding. More significantly, USAC incorporates tools that exploit how humans perceive music and tools that model how humans create speech and hence delivers unprecedented compression for content that is any mix of music and talk. This is an example of one of many areas where MPEG experts have challenged themselves to ensure that standardized solutions are available for future needs.

Special thanks to Schuyler Quackenbush, who contributed to this report.