INTERNATIONAL ORGANISATION FOR STANDARDISATION
ORGANISATION INTERNATIONALE DE NORMALISATION

ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND AUDIO

ISO/IEC JTC 1/SC 29/WG 11

MPEG/ N7610

October 2005, Nice, France

Title

MPEG-4 Terminal Architecture White Paper

Status

Input Document

Authors

Jean Le Feuvre, Cyril Concolato

Introduction

Complex, interactive multimedia computing is a very broad subject: support for numerous coding formats for natural audio and video, fundamental differences in user interface (2D internet portals, 3D gaming), in delivery networks (cable/satellite broadcasts, broadband internet or mobile networks) and in terminals (PC, set-top boxes, PDAs/Mobile Phones) have increased the multimedia market segmentation. This is where MPEG-4 comes into the big picture. The MPEG-4 standard (ISO/IEC 14496), officially entitled “Coding of audio-visual objects”, is a unified multimedia framework, featuring a wide range of coding techniques for natural audio and video, 2D and 3D graphics, complex synthetic objects such as 3D meshes and avatars or computer-generated music. But the unique strength of this standard lies in its underlying system architecture, powerful yet flexible enough to cover all needs of modern multimedia, from simple audio-video content to complex 3D worlds.

MPEG-4 Terminal Architecture

MPEG-4 data is carried by elementary streams, or logical transportation channels, and a stream can only carry a given type of data (scene data, visual data, etc…). An audio-visual object is composed of one or several of these streams, allowing scalable representations, alternate coding (bitrate, resolution, language…), enhanced with timed metadata (MPEG-7) and protection information. An object is described by an ObjectDescriptor, giving simple meta-data related to the object (ObjectContentInformation) such as content creation information or chapter time layout. This descriptor also contains all information related to stream setup, including synchronization information or initialization data for decoders. MPEG-4 objects are not static; they can be modified during the course of the presentation, in order to fine-tune the media data to network and client capabilities.

The scene description (BIFS, LASeR) is then used to place each object, with potentially various effects applied to it, on the display, and to determined how user interactions (pointing devices, keyboards, remote controllers) modify the content. As with all other objects, scene description data is carried in elementary streams, allowing dynamic control of the presentation at the server side.

 The MPEG-4 terminal can be divided in four logical blocks:

DMIF (Delivery Multimedia Integration Framework), where all communications and data transfer between the data source and the terminal are abstracted through a logical API called the DAI (DMIF Application Interface), regardless of the network type (broadcast or interactive), allowing easy integration of existing protocols inside the terminal. This layer is in charge of extracting the logical elementary stream from the physical input stream whenever needed, through demultiplexing (MPEG-2, M4Mux) or de-aggregation (MPEG-4 over RTP). This layer also provides the initial setup information for the presentation (usually the scene elementary stream information). It shall be noticed that most audio-video formats can be wrapped through DMIF, for example by using a default scene description, enabling support for many existing contents, such as internet movie files (MPEG1, AVI, RealMedia, QuickTime) or internet radios (IceCast and ShoutCast), at no cost for the terminal.

 The SL (Synchronization Layer), where data packets are received with their timing, packetization and other various information. Data packets are re-assembled into MPEG-4 Access Units, the base media unit understandable by the decoders, and streams synchronization is performed against the associated stream timeline. Note that several timelines may exist in a presentation, allowing media navigation (seeking, fast forward) from inside the content. The SL information can be configured in the elementary stream’s description, based on media or network constraints. This information can be transported physically (“bits on the wire”) or logically, extracted from other information carried by the underlying protocol (as for example in the transport of MPEG-4 streams on RTP). In a typical implementation, network QoS (Quality of Service) as well as decoding buffer (DB) management is performed at this stage.

The Compression Layer where all media decoding is performed. This may imply a decryption process between the decoding buffer and the decoder, as well as an encryption or watermarking process after the decoder, performed by MPEG-4 IPMP modules. One part of the terminal resource usage is estimated at this level for QoS management purposes. Note that this layer is the only place in the MPEG-4 terminal aware of the coding format, thus allowing addition of new coding standards in a very efficient way. For example, changing the MPEG-4 BIFS representation to the new LASeR representation doesn’t impact the terminal synchronization architecture in any way, although these two representations are completely different.

The Composition layer, performing the final layout of the objects. This stage is usually divided in two entities: the visual compositor, in charge of drawing 2D and 3D objects on screen, and the audio compositor responsible of audio objects positioning and mixing. The second main part of terminal resource usage is estimated here. The presentation engine is also managing user interactions, such as pointing device and keyboard inputs.

Advantages and Applications

The modularity of the network and synchronization layer enables hybrids delivery scenario, such as a mix of broadcast and on-demand streaming with possible back channels, typically suited for interactive television. The reliable synchronization of streams regardless of their origins along with stream-based quality of service management will help content providers guarantee their users a pleasant experience.

The modularity of the coding tools, expressed as MPEG well-known profiles and levels, allows for easy customization of the terminal for a dedicated marked segment (for example, music and cover art hardware players), reducing deployment costs while ensuring compatibility with full-featured MPEG-4 terminals, a key feature of modern multimedia, especially in the domain of media home gateways.

The dynamic and incremental construction of the scene description, a unique feature in MPEG-4, enhances the media experience by releasing the terminal of many of the operations needed by script-based scene descriptions languages. Coupled with MPEG-4 timing model, this guarantees a similar behavior of the content on all terminals.

Typical applications include interactive broadcast TV services (betting, shopping, sport statistics…), entertainment content (enhanced DVD, simple edutainment), corporate training, 2-Dimensional and 3-Dimensional multimedia portals. The addition of ECMAScript and MPEG-J technologies opens an almost limitless field of application of the MPEG-4 standard.

References

 [1]       ISO/IEC 14496-1, Coding of audio-visual objects, Part 1: Systems.

[2]        ISO/IEC 14496-11, Coding of audio-visual objects, Part 11: Delivery Multimedia Integration Framework.

[3]        ISO/IEC 14496-6, Coding of audio-visual objects, Part 6: Scene description and Application engine (BIFS, XMT, MPEG-J).

[4]        ISO/IEC 14496-20, Coding of audio-visual objects, Part 20: Lightweight Scene Representation (LASeR).

[5]        ISO/IEC 14496-13, Coding of audio-visual objects, Part 13: Intellectual Property Management Protection (IPMP) extensions.

[6]        ISO/IEC 16262, ECMAScript language specification.