|
Olivier Avaro |
Alexandros Eleftheriadis |
Carsten Herpel |
|
Ganesh Rajan |
Liam Ward |
Abstract
This paper gives an overview of Part 1 of ISO/IEC 14496 (MPEG-4 Systems). It first presents the objectives of the MPEG-4 activity. In the MPEG-1 and MPEG-2 standards, "Systems" referred only to overall architecture, multiplexing, and synchronization. In MPEG-4, in addition to these issues, the Systems part encompasses scene description, interactivity, content description, and programmability. The description of the MPEG-4 specification follows, starting from the general architecture up to the description of the individual MPEG-4 Systems tools. Finally, a conclusion describes the future extensions of the specification, as well as a comparison between the solutions provided by MPEG-4 Systems and some alternative technologies.
Keywords
APIs, Architecture, Audio-visual, Buffer Management, Composition, Content Description, Interactivity, MPEG-4 Systems, Multiplex, Programmability, Scene Description, Specification, Synchronization, Tools.
Table of contents
1. Introduction *
2. Objectives *
2.1 Requirements *
2.2 Traditional MPEG Systems Requirements *
2.3 MPEG-4 Specific Systems Requirements *
2.4 What is MPEG-4 Systems? *
3. Architecture *
4. Tools *
4.1 Stream Management: The Object Description Framework *
4.2 Presentation Engine: BIFS *
4.3 Timing and Synchronization: The Systems Decoder Model (SDM) and the Sync Layer *
4.4 The Transport of MPEG-4 Content *
5. Conclusion *
5.1 Extensions of the Specification *
5.2 MPEG-4 Systems and Competing Technologies *
5.2.1 Transport *
5.2.2 Streaming Framework *
5.2.3 Scene Description Representation *
5.3 The key features of MPEG-4 Systems *
6. Acknowledgments *
7. References *
The concept of "Systems" in MPEG has evolved dramatically since the development of the MPEG-1 and MPEG-2 standards. In the past, "Systems" referred only to overall architecture, multiplexing, and synchronization. In MPEG-4, in addition to these issues, the Systems part encompasses scene description, interactivity, content description, and programmability. The combination of the exciting new ways of creating compelling interactive audio-visual content offered by MPEG-4 Systems, and the efficient representation tools provided by the Visual and Audio parts, promise to be the foundation of a new way of thinking about audio-visual information.
This paper gives an overview of MPEG-4 Systems. It is structured around the objectives, architecture, and the tools of MPEG-4 Systems as follows:
Of course, MPEG-4 is not the only initiative that attempts to provide solutions in the area described above. Several companies, industry consortia, and even other standardization bodies have developed technologies that, to some extent, also aim to address objectives similar to those of MPEG-4 Systems. In concluding this look at MPEG-4 Systems, this paper provides an overview of some of these alternative technologies and makes a comparison with the solutions provided by MPEG-4 Systems.
To understand the rationale behind the activity, a good starting point is one of the most fundamental MPEG-4 documents, viz., the MPEG-4 Requirements [1]. This document gives an extensive list of the objectives that needed to be satisfied by the MPEG-4 specifications. The goal of specifying a standard way for the description and coding of audio-visual objects was the primary motivation behind the development of the tools in the MPEG-4 Systems.
MPEG-4 Systems requirements may be categorized into two groups:
To round out this discussion on the MPEG-4 objectives, section 2.4 finally provides an answer to the question "What is MPEG-4 Systems?" by summarizing the objectives of the MPEG-4 Systems activity and describing the charter of the MPEG-4 Systems sub-group during its four years of existence.
The work of MPEG traditionally addressed the representation of audio-visual information. In the past, this included only natural audio and video material. As we will indicate in subsequent sections, the types of media included within the scope of the MPEG-4 standards have been significantly extended. Regardless of the type of the media, each one has spatial and/or temporal attributes and needs to be identified and accessed by the application consuming the content. This results in a set of requirements for MPEG Systems on streaming, synchronization and stream management, further described below.
In the previous MPEG-1 and MPEG-2 standards, these requirements led to the definition of the following tools:
All these requirements are still relevant for MPEG-4. However, the existing tools needed to be extended and adapted for the MPEG-4 context. In some cases, these requirements led to the creation of new tools. More specifically:
The foundation of MPEG-4 is the coding of audio-visual objects. As per MPEG-4 terminology, an audio-visual object is the representation of a natural or synthetic object that has an audio and/or visual manifestation. Examples of audio-visual objects include a video sequence (perhaps with shape information), an audio track, an animated 3D face, speech synthesized from text, or a background consisting of a still image.
The advantages of coding audio-visual objects can be summarized as follows:
In order to be able to use these audio-visual objects in a presentation, additional information needs to be transmitted to the client terminals. The individual audio-visual objects are only a part of the presentation structure that an author wants delivered to the consumers. Indeed, for the presentation at the client terminals, the coding of audio-visual objects needs to be augmented by the following:
These considerations imply additional requirements for the overall architectural design, which are summarized below:
Besides the coding of audio-visual objects organized spatio-temporally, according to a scene description, one of the key concepts of MPEG-4 is the idea of interactivity, that is, that the content reacts upon the action of a user. This general idea is expressed in three specific requirements:
The main concepts that were described in this section are depicted in Figure 1. The mission, therefore, of the MPEG-4 Systems activity may be summarized by the following sentence: "Develop a coded, streamable representation for audio-visual objects and their associated time-variant data along with a description of how they are combined".

Figure 1: MPEG-4 Systems Principles
More precisely, in this sentence:
The overall architecture of an MPEG-4 terminal is depicted in Figure 2. Starting at the bottom of the figure, we first encounter the particular storage or transmission medium. This refers to the lower layers of the delivery infrastructure (network layer and below, as well as storage). The transport of the MPEG-4 data can occur on a variety of delivery systems. This includes MPEG-2 Transport Streams, UDP (User Datagram Protocol) over IP (Internet Protocol), ATM (Asynchronous Transfer Mode) AAL2 (ATM Adaptation Layer 2), MPEG-4 (MP4) files or the DAB (Digital Audio Broadcasting) multiplexer.

Figure 2: MPEG-4 Systems Architecture
Most of the currently available transport layer systems provide native means for multiplexing finformation. There are, however, a few instances where this is not the case, like in GSM (Global Systems for Mobile communication) data channels. In addition, the existing multiplexing mechanisms may not fit MPEG-4 needs in terms of low delay, or they may incur substantial overhead in handling the expected large number of streams associated with an MPEG-4 session. As a result, the FlexMux tool can optionally be used on top of the existing transport delivery layer.
Regardless of the transport layer used and the use (or not) of the FlexMux option, the delivery layer provides to the MPEG-4 terminal a number of elementary streams. Note that not all of the streams have to be downstream (server to the client); in other words, it is possible to define elementary streams for the purpose of conveying data back from the terminal to the transmitter or server.
In order to isolate the design of MPEG-4 from the specifics of the various delivery systems, the concept of the DMIF (Delivery Multimedia Integration Framework) Application Interface (DAI) [7] was defined. This interface defines the process of exchanging information between the terminal and the delivery layer in a conceptual way, using a number of primitives. It should be pointed out that this interface is non-normative; MPEG-4 terminal implementations do not need to expose such interface.
The DAI defines procedures for initializing an MPEG-4 session and obtaining access to the various elementary streams that are contained in it. These streams can contain a number of different information: audio-visual object data, scene description information, control information in the form of object descriptors, as well as meta-information that describes the content or associates intellectual property rights to it.
Regardless of the type of data conveyed in each elementary stream, it is important that they use a common mechanism for conveying timing and framing information. The Sync Layer (SL) is defined for this purpose. It is a flexible and configurable packetization facility that allows the inclusion of timing, fragmentation, and continuity information on associated data packets. Such information is attached to data units that comprise complete presentation units, e.g., an entire video object plane (VOP) or an audio frame. These are called access units. An important feature of the SL is that it does not contain frame demarcation information; in other words, the SL header contains no packet length indication. This is because it is assumed that the delivery layer that processes SL packets will already make such information available. Its exclusion from the SL thus eliminates duplication.
The SL is the sole mechanism of implementing timing and synchronization mechanisms in MPEG-4. The fact that it is highly configurable allows the use of several different models. At one end of the spectrum, traditional clock recovery methods using clock references and time stamps can be used. It is also possible to use a rate-based approach (rather than using explicit timestamps, the known rate of the access units implicitly determines their time stamps). At the other end of the spectrum, it is possible to operate without any clock information data is processed as soon as it arrives. This would be suitable, for example, for a slide-show presentation. The primary mode of operation, and the one supported by the currently defined conformance points of the specification, involves the full complement of clock recovery and time stamps. By defining a System Decoder Model, this makes it possible to both synchronize the receivers clock to the senders, as well as manage the buffer resources at the receiver.
From the SL information we can recover a time base as well elementary streams. The streams are sent to their respective decoders that process the data and produce composition units (e.g., a decoded video object plane). In order for the receiver to know what type of information is contained in each stream, control information in the form of object descriptors is used. These descriptors associate sets of elementary streams to one audio or visual object, define a scene description stream, or even point to an object descriptor stream. These descriptors, in other words, are the way with which a terminal can identify the content being delivered to it. Unless a stream is described in at least one object descriptor, it is impossible for the terminal to make use of it.
At least one of the streams must be the scene description information associated with the content. The scene description information defines the spatial and temporal position of the various objects, their dynamic behavior, as well as any interactivity features made available to the user. As mentioned above, the audio-visual object data is actually carried in its own elementary streams. The scene description contains pointers to object descriptors when it refers to a particular audio-visual object. We should stress that it is possible that an object (in particular synthetic objects like text and simple graphics) may be fully described by the scene description. As a result, it may not be possible to uniquely associate an audio-visual object with just one syntactic component of MPEG-4 Systems. As detailed in Section 4.2, the scene description is tree-structured and is heavily based on VRML (Virtual Reality Modeling Language [4]) structure.
A key feature of the scene description is that, since it is carried in its own elementary stream(s), it can contain full timing information. This implies that the scene can be dynamically updated over time, a feature that provides considerable power for content creators. In fact, the scene description tools provided by MPEG-4 also provide a special lightweight mechanism to modify parts of the scene description in order to effect animation. This is accomplished by coding, in a separate stream, only the parameters that need to be updated.
The systems compositor uses the scene description information, together with decoded audio-visual object data, in order to render the final scene that is presented to the user. It is important to note that the MPEG-4 Systems architecture does not define how information is to be rendered. In other words, the Systems part of the MPEG-4 standard does not detail mechanisms through which the values of the pixels to be displayed or audio samples to be played back can be uniquely determined. This is an unfortunate side-effect of providing synthetic content representation tools. Indeed, in the general case, it is not possible to define rendering without venturing into issues of terminal implementation. Although this makes compliance testing much more difficult (requiring subjective evaluation), it allows the inclusion of a very rich set of synthetic content representation tools. In some cases however, like in the Audio part of the MPEG-4 standard [6], composition can be and is fully defined.
The scene description tools provide mechanisms to capture user or system events. In particular, they allow the association of events to user operations on desired objects, that can in turn modify the behavior of the stream. Event processing is the core mechanism with which application functionality and differentiation can be provided. In order to provide flexibility in this respect, MPEG-4 allows the use of ECMAScript (also known as JavaScript) scripts within the scene description. Use of scripting tools is essential in order to access state information and implement sophisticated interactive applications.
It is important to point out that, in addition to the new functionalities that MPEG-4 makes available to content consumers, it provides tremendous advantages to content creators as well. The use of an object-based structure, where composition is performed at the receiver, considerably simplifies the content creation process. Starting from a set of coded audio-visual objects, it is very easy to define a scene description that combines these objects in a meaningful presentation. A similar approach is essentially used in HTML (Hyper Text Markup Language) and Web browsers, thus allowing even non-expert users to easily create their own content. The fact that the contents structure survives the process of coding and distribution, also allows for its reuse. For example, content filtering and/or searching applications can be easily implemented using ancillary information carried in object descriptors (or its own elementary streams, as described in Section 4.1). Also, users themselves can easily extract individual objects, assuming that the intellectual property information allows them to do so.
In the following section the different components of this architecture are described in more detail.
The Object Description Framework provides the glue between the scene description and the streaming resources the elementary streams of an MPEG-4 presentation, as indicated in Figure 1. Unique identifiers are used in the scene description to point to the object descriptor, the core element of the object description framework. The object descriptor is a container structure that encapsulates all of the setup and association information for a set of elementary streams. A set of sub-descriptors, contained in the object descriptor, describe the individual elementary streams, including the configuration information for the stream decoder as well as the flexible sync layer syntax for this stream. Each object descriptor, in turn, groups a set of streams that are seen as a single entity from the perspective of the scene description.
Object descriptors are transported in dedicated elementary streams, called object descriptor streams, that make it possible to associate timing information to a set of object descriptors. With the appropriate wrapper structures, called OD commands (Object Descriptor Commands), around each object descriptor, it is possible to update and remove each object descriptor in a dynamic and timely manner. The existence or the absence of descriptors determines the availability (or the lack thereof) of the associated elementary streams to the MPEG-4 terminal.
The initial object descriptor, a derivative of the object descriptor, is a key element necessary for accessing the MPEG-4 content. It conveys content complexity information in addition to the regular elements of an object descriptor. As depicted in Error! Reference source not found., the initial object descriptor usually contains at least two elementary stream descriptors. One of the descriptor must point to a scene description stream while the others may point to an object descriptor stream. This object descriptor stream transports the object descriptors for the elementary streams that are referred to by some of the components in the scene description. Initial object descriptors may themselves be transported in object descriptor streams since they allow content to be hierarchically nested, but may as well be conveyed by other means, serving as starting pointers to MPEG-4 content.

Figure 3: The initial object descriptor and the linking of elementary streams to the scene description
In addition to providing essential information about the relation between the scene description and the elementary streams, the object description framework provides mechanisms to describe hierarchical relations between streams, reflecting scalable encoding of the content and means to indicate multiple alternate representations of content. Furthermore, textual descriptors about content items, called object content information (OCI), and descriptors for the intellectual property rights management and protection (IPMP) have been defined. The latter allow conditional access or other content control mechanisms to be associated to a particular content item. These mechanisms may be different on a stream-by-stream basis and possibly even a multiplicity of such mechanisms could co-exist.
A single MPEG-4 presentation, or program, may consist of a large number of elementary streams with a multiplicity of data types. The object description framework has been separated from the scene description to account for this fact and the related consequence that service providers may possibly wish to relocate streams in a simple way. Such relocation may require changes in the object descriptors; however, it will not affect the scene description. Therefore object descriptors improve content manageability.
The reader is referred to the MPEG-4 Systems specification [2] for the syntax and semantics of the various components of this framework and their usage within the context of MPEG-4.
MPEG-4 specifies a BInary Format for Scenes (BIFS) that is used to describe scene composition information: the spatial and temporal locations of objects in scenes, along with their attributes and behaviors. Elements of the scene and the relationships between them form the scene graph that must be coded for transmission. The fundamental scene graph elements are the "nodes" that describe audio-visual primitives and their attributes, along with the structure of the scene graph itself. BIFS draws heavily on this and other concepts employed by VRML [4].
Designed as a file format for describing 3D models and scenes ("worlds" in VRML terminology), VRML lacks some important features that are required for the types of multimedia applications targeted by MPEG-4. In particular, the support for natural video and audio are basic (ex: streaming of audio or video objects are not supported) and the timing model is loosely specified, implying that synchronization in a scene consisting of multiple media types cannot be guaranteed. Furthermore, VRML worlds are often very large (ex: there is neither compression nor animations streaming). Animations lasting around 30 seconds typically consume several megabytes of disk space. The strength of VRML is its scene graph description capabilities and this strength has been the basis upon which MPEG-4 scene description has been built.
BIFS includes support for almost all of the nodes in the VRML specifications. In fact, BIFS is essentially a superset of VRML, although there are some exceptions. BIFS does not yet support the PROTO and EXTERNPROTO nodes, nor does it support the use of Java language in the Script nodes (BIFS only supports ECMAScript). BIFS does, however, expand significantly on VRML's capabilities in ways that allow a much broader range of applications to be supported. Note that a fundamental difference between the two is that BIFS is a binary format, whereas VRML is a textual format. So, although it is possible to design scenes that are compatible with both BIFS and VRML, transcoding of the representation formats are required.
Here, we highlight the functionalities that BIFS adds to the basic VRML set. Readers unfamiliar with VRML might find it useful to first acquire some background knowledge from [4].
Consider a simple coding scheme with the following tags:
<begin> - beginning of record
<end> - end of record
<break> - end of element in record
<string> - text string follows
<number> - number follows
We wish to use this scheme code a record consisting of first name, last name and phone number, for example:
First name: Jim
Last name: Brown
Phone: 777 1234
With no knowledge context we would need to code this as:
<start><string>Jim<break><string>Brown<break><number>7771234<break><end>
If the context is known, i.e. we know that the structure of the record is "string, string, number" we do not have to spend bits specifying the type of each element:
<start>Jim<break>Brown<break>7771234<end>
Figure 4: Simple example of the use of context in efficient coding
The MPEG-4 SDM is conceived as an adaptation of its MPEG-2 predecessor. The System Target Decoder in MPEG-2 is a model that precisely describes the temporal and buffer constraints under which a set of elementary streams may be packetized and multiplexed. Due to the generic approach taken towards stream delivery which includes stream multiplexing MPEG-4 chose not to define multiplexing constraints in the SDM. Instead, the SDM assumes the concurrent delivery of an arbitrary number of already demultiplexed elementary streams to the decoding buffers of their respective decoders. A constant end-to-end delay is assumed between the encoder output and the input to the decoding buffer on the receiver side. This leaves the task of handing the delivery jitter (including multiplexing) to the delivery layer.
Timing of streams is expressed in terms of decoding and composition time of individual access units within the stream. Access units are the smallest sets of data to which individual presentation time stamps can be assigned (e.g., a video object plane). The decoding time stamp indicates the point in time at which an access unit is removed from the decoding buffer, instantaneously decoded, and moved to the composition memory. The composition time stamp allows the separation of decoding and composition times, to be used for example in the case of bi-directional prediction in visual streams. This idealized model allows the encoding side to monitor the space available in the decoding sides buffers, thus helping it, for example, to schedule ahead-of-time delivery of data. Of course, a resource management for the memory for decoded data would be desirable as well. However, it was acknowledged that this issue is strongly linked with memory use for the composition process itself, which is considered outside the scope of MPEG-4. Therefore, management of composition buffers is not part of the model.
Time stamps are readings of an object time base (OTB) that is valid for an individual stream or a set of elementary streams. At least all the streams belonging to one audio-visual object have to follow the same OTB. Since the OTB in general is not a universal clock, object clock reference time stamps can be conveyed periodically with an elementary stream to make it known to the receiver. This is, in fact, done on the wrapper layer around elementary streams, called the sync layer (SL). The sync layer provides the syntactic elements to encode the partitioning of elementary streams into access units and to attach both decoding and composition time stamps as well as object clock references to a stream. The resulting stream is called an SL-packetized stream. This syntax provides a uniform shell around elementary streams, providing the information that needs to be shared between the compression layer and the delivery layer in order to guarantee timely delivery of each access unit of an elementary stream.
Different from its predecessor in MPEG-2, the packetized elementary stream (PES), the sync layer does not constitute a self-contained stream, but rather a packet-based interface to the delivery layer. This takes into account the properties of typical delivery layers like IP, H.223 or MPEG-2 itself, into which SL-packetized streams are supposed to be mapped. There is no need to encode either unique start codes or the length of an SL packet within the packet, since synchronization and length encoding are already provided by the mentioned delivery layer protocols.
Furthermore, MPEG-4 has to operate both at very low and rather high bitrates. This has led to a flexible design of the sync layer elements, making it possible to encode time stamps of configurable size and resolution, as required in a specific content or application scenario. The flexibility is made possible by means of a descriptor that is conveyed as part of the elementary stream descriptor that summarizes the properties of each (SL-packetized) elementary stream.
Delivery of MPEG-4 content is a task that is supposed to be dealt with outside the MPEG-4 Systems specification. All access to delivery layer functionality is conceptually done only through a semantic interface, called the DMIF Application Interface (DAI). It is specified in Part 6 of MPEG-4, Delivery Multimedia Integration Framework (DMIF) [7]. In practical terms this means that the specification of control and data mapping to underlying transport protocols or storage architectures is to be done jointly with the respective organization that manages the specification of the particular delivery layer. For example, for the case of MPEG-4 transport over IP, development work is done jointly with the Internet Engineering Task Force (IETF).
An analysis of existing delivery layer properties showed that there might be a need for an additional layer of multiplexing, in order to map the occasionally bursty and low bitrate MPEG-4 streams to a delivery layer protocol that exhibits fixed packet size or too much packet overhead. Furthermore, the provision of a large number of delivery channels may have a substantial burden in terms of management and cost. Therefore, a very simple multiplex packet syntax has been defined, called the FlexMux. It allows multiplexing a number of SL-packetized streams into a self-contained FlexMux stream with rather low overhead. It is proposed as an option to designers of the delivery layer mappings but is not used for the definition of MPEG-4 conformance points.
The technologies considered for standardization in MPEG-4 were not all identically mature. Therefore, the MPEG-4 project in general and MPEG-4 Systems in particular, was organized in two phases: Version 1 and Version 2. The tools described above already contain the majority of the functionality of MPEG-4 Systems and allow the development of compelling multimedia applications. These are provided by the current MPEG-4 standard, so called MPEG-4 "Version 1". Extension of the standard in the form of amendments, the so called "Version 2", completes the Version 1 toolbox with new tools and new functionalities. Version 2 tools are not intended to replace any of the Version 1 tools. On the contrary, Version 2 is a completely backward compatible extension of Version 1.
Version 2 will provide for the following additional BIFS functionalities:
In Version 1 of MPEG-4, there is no normative support for the structure of upstream data or its semantics. Version 2 standardizes both the mechanisms with which the transmission of such data is triggered at the terminal, as well as its formats as it is transmitted back to the sender. The inclusion of a normative specification for backchannel information in Version 2 closes the loop between the terminal and its server, vastly expanding the types of applications that can be implemented on an MPEG-4 infrastructure.
The BIFS Scene Description framework, described in the previous section, offers a parametric methodology for scene structure representation in addition to efficiently coding it for transmission over the wire. Version 2 of the MPEG-4 standard also offers a programmatic environment, in addition to this parametric capability. Version 2 defines a set of JavaÔ language APIs (MPEG-J) through which access to an underlying MPEG-4 engine can be provided to Java applets (called MPEG-lets). This tool forms the basis for very sophisticated applications, opening up completely new ways for audio-visual content creators to augment the use of their content.
Finally, MPEG-4 Systems completes the toolbox for transport and storage of MPEG-4 content by providing:
This section aims to complete the description of MPEG-4 Systems by trying to make a fair comparison between the tools provided by MPEG-4 Systems and the ones that can be found or will be found in a near future in applications in the market place.
Technical issues aside, the mere fact of being proprietary is a significant disadvantage in the content industry when open standard alternatives exist. With the separation of content production, delivery, and consumption stages in the multimedia pipeline, the MPEG-4 standard will enable different companies to separately develop authoring tools, servers, or players, thus opening up the market to independent product offerings. This competition is then very likely to allow a fast proliferation of content and tools that will inter-operate.
As stated in Section 4.4, MPEG-4 Systems does not specify or standardize a transport protocol. In fact, it is designed to be transport-agnostic. However, in order to be able to utilize the existing transport infrastructures (e.g. MPEG-2 transport or IP networks), MPEG-4 defines an abstraction of the delivery layer with specific mappings of MPEG-4 content on existing transport mechanisms [7]. However, there are two exceptions in terms of the abstraction of the delivery layer: the MPEG-4 File Format and the FlexMux tool.
There are several available file formats for storing, streaming, and authoring multimedia content. Among the ones most used presently are Microsofts ASF (Advanced Streaming Format), Apples QuickTime, as well as RealNetworks file format (RMFF). The ASF and QuickTime formats were proposed to MPEG-4 in response to a call for proposals on file format technology. QuickTime has been selected as the starting point for the collaborative development of the MPEG-4 file format (referred to as MP4) [3]. The RMFF format has several similarities with QuickTime (in terms of object tagging using four-character strings and the way indexing information is provided). MP4 inherits from QuickTime key technical features such as the ability to stream content from multiple sources (local or through a network) with interactivity and known rendering quality. In addition to these key features, the MP4 format adds support for MPEG-4 specific features. In particular, MP4 inherits from MPEG-4 all the new and compelling audio, video and systems multimedia content.
The MPEG-4 proposal for light-weight multiplexing (FlexMux) addresses some MPEG-4 specific needs as described in Section 4.4. The same kinds of requirements have recently been raised within the Internet Engineering Task Force (IETF) with regards to delivery of multimedia content over IP networks. With the increasing number of streams resident in a single multimedia program, with possibly low network bandwidths and unpredictable temporal network behavior, the overhead incurred by the use of RTP streams and their management in the receiving terminals is becoming considerable. IETF is therefore currently investigating a generic multiplexing solution, that has requirements similar to that of the MPEG-4 FlexMux. We expect that, with the close collaboration between IETF and MPEG-4, a consistent solution will be developed for the transport of MPEG-4 content over IP networks.
With the specifications of the Object Description framework and the Sync Layer, MPEG-4 Systems provides a consistent and efficient framework for the description of content and the means for its synchronized presentation at client terminals. At this juncture, this framework, with its flexibility and dynamics, does not have any equivalents in the standards arena. A parallel could be drawn with the combination of RTP and SDP (Session Description Protocol); however, such a solution is Internet-specific and cannot be applied directly on other systems like digital cable or DVDs (Digital Video Disc).
Within the context of Web applications, a number of competitors to MPEG-4 BIFS base their syntax architecture on the XML (Extensible Markup Language) [9] syntax, while MPEG-4 bases its syntax architecture on VRML using a binary, SDL-described form (MPEG-4 Syntactic Description Language). The main difference between the two is that XML is a general purpose text-based description language for tagged data, while VRML with SDL provide a binary format for a scene description language.
The competition is first at the level of the semantics, i.e., the functionality provided by the representation. At the time the MPEG-4 standard was published, several specifications were providing semantics with an XML-compliant syntax to solve multimedia representation in specific domains. For example, the W3C (World Wide Web Consortium) HTML-NG (HTML Next Generation) was redesigning HTML to be XML compliant [10]. The W3C SMIL (Synchronized Multimedia Integration Language) working group has produced a specification for 2D-multimedia scene description [11]. The ATSC/DASE (Advanced Television Systems Committee/Digital-TV Application Software Environments) BHTML (Broadcast HTML) specifications were working at providing broadcast extensions to HTML-NG. The Web3D Consortium X3D (Extensible 3D) requirements were investigating the use of XML for 3D scene description, whereas W3C SVG (Scalable Vector Graphics) was standardizing scalable vector graphics also in an XML compliant way [12].
MPEG-4 is built on a true 3D scene description, including the event model, as provided by VRML. None of the XML-based specifications currently available reaches the sophistication of MPEG-4 in terms of composition capabilities and interactivity features. Furthermore, incorporation of the temporal component in terms of streamed media, including scene descriptions, is a non-trivial matter. MPEG-4 has successfully addressed this issue, as well as the overall timing and synchronization issues, whereas alternative approaches are lacking in this respect.
A second level of competition can be seen in the coded representation of the scene structure (text-based versus binary representation). An advantage of a text-based approach is ease of authoring; documents can be easily generated using a text editor. Such textual representations for MPEG-4 content, based on extensions of VRML or on XML, were under consideration at the time this paper was published. However, for delivery and streaming over a finite bandwidth medium, a compressed representation of multimedia information is without a doubt the best approach from the bandwidth efficiency point of view. This is the problem primarily addressed and solved by MPEG-4. None of the XML-based approaches have addressed satisfactorily this issue up to now.
Indeed, in addition to XML semantics that leverage the functionality developed by MPEG-4, a complete competitive solution would also need to define a binary mapping as well as media streaming and synchronization mechanisms. In June 1999, there was no evidence how this could happen on a short or medium term schedule. Indeed, all the potential alternative frameworks are at the stage of research and specification development, while MPEG-4 is at the stage of standard verification and deployment. There is no evidence that any of these frameworks will be able to leverage efficiently all of the advantages of MPEG-4 specifics, including compression, streaming and synchronization. Finally, the fragmented nature of the development of XML-based specifications by different bodies and industries certainly hinders integrated solutions. This may therefore cause a distorted vision of the integrated, targeted system as well as duplication of functionality.
In summary, the key features of MPEG-4 Systems can be stated as follows:
The MPEG-4 Systems tools can be used separately in some applications. But MPEG-4 Systems also guarantees that they will work together in an integrated way, as well as with the other tools specified within the MPEG-4 standards.
The MPEG-4 System specification reflects the results of teamwork within a worldwide project in which many people invested enormous time and energy. The authors would like to thank them all, and hope that the experience and results achieved at least matched the level of their expectations.
Among the numerous contributors to the MPEG-4 Systems projects, three individuals stand out for their critical contributions in the development of MPEG-4: Cliff Reader is the person that originally articulated the vision of MPEG-4 in an eloquent way and led it through its early formative stages; Phil Chou was the first to propose a consistent and complete architecture that could support such a vision; and Zvi Lifshitz who, by leading the IM1 software implementation project, was the one that made the vision manifest itself in the form of a real-time player.
Finally, the authors would like to thank: Ananda Allys (France Telecom CNET), for providing some of the pictures used in this paper; the European project MoMuSys for its support of Olivier Avaro and Liam Ward; the National Science Foundation and the industrial sponsors of Columbias ADVENT Project for their support of Alexandros Eleftheriadis; the German project MINT for its support of Carsten Herpel; the General Instrument Corporation for their support to Ganesh Rajan.