INTERNATIONAL ORGANIZATION FOR STANDARDIZATION
ORGANISATION INTERNATIONALE NORMALISATION
ISO/IEC JTC 1/SC 29/WG 11
CODING OF MOVING PICTURES AND AUDIO
ISO/IEC JTC 1/SC 29/WG 11N7504
July 2005, Poznan
Title |
White paper on MPEG-4 System (ISO/IEC 14496-1) |
Source |
Systems |
Status |
Proposal |
Editors |
LIM, Young-Kwon (net&tv Inc.) |
ISO/IEC 14496-1, also known as MPEG-4 Systems, provides the technologies to interactively and synchronously represent and deliver audio-visual contents composed of various objects including audio, visual, 2D/3D vector graphics and etc. It specifies a terminal model for time and buffer management, a framework for the identification, description and logical dependencies of the elementary streams, a packet structure carrying synchronization information and a multiplexed representation of individual elementary streams in a single stream (M4Mux).
Since MPEG-4 Systems shall be capable of working in “push” scenario where the sender transmit the data to the terminal without receiving any signal requesting the data, MPEG-4 Systems defines a conceptual model of the terminal complying with MPEG-4 for the sender to estimate the behavior of the terminal in terms of buffer management and synchronization of data coming from different elementary streams. In this model, two conceptual interfaces are defined to describe the decoder behavior. DAI (DMIF Application Interface) is the interface between the demultiplexer and th decoding buffer, and ESI (Elementary Stream Interface) is the interface between the decoding buffer and decoder. DAI provides series of packets, called SL packets, to the decoding buffer of each elementary stream. SL packet contains one full access unit or the fragment of it as a payload and also carries the timing information of the payload for decoding and composition in a header. Each access unit remains in the decoding buffer before the decoding time arrives and produces a composition unit as a result of decoding which will remain the composition memory until the composition time arrives. By using this conceptual model, sender can guarantee the stream does not break the terminal receiving it by causing overflow or underflow of the decoding buffer or composition.

Figure 1. System Decoder Model
To integrate various audio-visual objects into one presentation, MPEG-4 provides scene description information for spatio-temporal composition of objects and the object description framework to convey information to associate elementary streams to audio-visual objects used in the scene. Each object in the scene requiring association to the elementary stream contains the unique identifier of object descriptor, called object descriptor identifier and object descriptor contains unique identifier of elementary stream, called elementary stream identifier. Object descriptor is a collection of descriptors about elementary streams and each object descriptor can contain more than one descriptors associated to a single object to support scalability of audio-visual objects or capability of choosing one among multiple elementary streams based on user preference such as languages. Elementary stream descriptors include information about the encoding format, configuration information for the decoding process and the sync layer packetization, as well as quality of service requirements for the transmission of the stream and intellectual property identification.
Since scene description and object descriptors are also carried as independent elementary streams, initial object descriptor, a special variant of object descriptor, is designed to be provide the information to access an MPEG-4 presentation. As shown in the Figure 2, initial object descriptor conveys two descriptors about the elementary streams carrying scene description and serious of the object descriptors which are carrying the pointers to each elementary streams.

Figure 2. MPEG-4 contents access process
Synchronization between multiple objects is achieved by two well known concept such as clock reference and time stamps. Clock reference is used to send the time base of receiving terminals by specifying the anticipated value of the clock when the first byte of the packet carrying such information is arrived. Two time stamps can be carried per access unit as explained above, decoding time and composition time. Since each object has different requirements on synchronization and the structure of SL packet can be configured differently per elementary stream, each elementary stream can define its own resolution of timing information and the length of time stamps.
[1] ISO/IEC 14496-1, Information technology – coding of audio-visual objects – Part 1: Systems