The Moving Picture Experts Group

Streaming Text Format

Part number: 
Activity status: 

MPEG Streaming Text 


MPEG doc#: N7515

Date: July 2005

Author: Jan van der Meer



This Specification was developed in response to the need for a generic method for coding of text at very low bitrate as one of the multimedia components within audiovisual presentations. This specification allows for example subtitles and Karaoke song texts to be coded and transported as separate text streams for presentation jointly with other components of an audiovisual presentation at bitrates that are sufficently low for use in mobile services over IP.

Coding of text streams

MPEG-4 part 17 defines Text Streams that are capable of carrying 3GPP Timed Text, as specified in 3GPP TS 26.245. To transport the text streams, a flexible framing structure is specified that can be conveniently adapted to the various transport layers, such as RTP for transport over IP and MPEG-2 Systems, for use in media such as broadcast and optical discs.

3GPP TS 26.245 defines timed text data to consist of text samples and sample descriptions, and that each text sample consists of one text string, optionally followed by one or more text modifiers. Each text string represents the characters that form the text to be displayed, while the text modifiers carry the changes that are to be applied to the text string during the time that the text is to be displayed within a text box, such as text color changes synchronized with a song for a Karaoke application.

A sample description provides global information about a text sample, for example about font(s) to be used, about the positioning of the text within the text box, the background colour of that text box, etc. Multiple sample descriptions are allowed; to each sample description (SD) an index is assigned and to each text sample the index of the applicable sample description is associated. While a sample description will typically apply to multiple text samples, to each text sample exactly one sample description applies.

Typically, a 3GPP text access unit is small, around 100 – 200 bytes, and often much smaller than the size of the packets that carry the text data across a transport network. It is therefore expected that transport systems will often aggregate multiple 3GPP text access units into one transport packet. On the other hand, 3GPP text access units can also be large, for example when scrolling horizontal text at the bottom of the screen, in which case fragmentation of 3GPP text access units may be required prior to transport. In conclusion, transport of 3GPP text access units often requires aggregation and sometimes fragmentation. So as to conveniently aggregate and fragment 3GPP text access units in a transport independent manner, this Specification defines a flexible framing structure consisting of so-called TTUs, Timed Text Units.

Five different types of TTUs are defined; one for carriage of a complete 3GPP text access unit, three for carriage of text sample fragments, and one for carriage of a complete sample description, while three types are reserved for future use. Because sample descriptions are small, there is no support for carriage of sample description fragments.

The flexible framing structure provided by TTUs allows for easy and convenient adaptation to the various transport layers, while performing TTU alignment with the applied transport packets. For each transport layer the most suitable TTU structure can be chosen. For example, by using TTUs, small text samples can be aggregated into one transport packet, but TTUs can also be used to fragment text samples across multiple transport packets, while providing a reasonable level of error resilience in case of packet loss or non-recoverable packet errors. If so desired, the text data within a text access units can be re-partitioned into TTUs for most effective adaptation to other transport systems. See Figure 1. TTUs are defined for 3GPP text streams only, but comparable structures may be defined for other text streams in future versions of this Specification.  

Figure 1 — Carriage of text samples and sample descriptions in 3GPP text access units and the use of TTUs for creating a 3GPP text stream

For transport over IP, RFC 3640 has been defined to transport general MPEG-4 content. RFC 3640 is also suitable to transport MPEG-4 text streams. On the other hand, in IETF also RFC xxxx has been defined. RFC xxxx directly packetizes 3GPP Timed Text data, without first constructing an MPEG-4 text stream; see figure 2.

Figure 2         ISO/IEC 14496-17 and transport of 3GPP Timed Text.

To ensure maximum interoperability, the authors of RFC xxxx and of MPEG-4 part 17 have closely collaborated. As a result, both specifications use exactly the same structure for transport of 3GPP timed text data, which allows the RTP packets payloads of RFC 3640 and RFC xxxx to be fully identical, but this is achieved only when for RFC 3640 the same packetization rules are used as defined for RFC xxxx.

Target applications

Target application are in particular found in area’s with severe transmission bandwidth constraints, such as mobile services over IP. However, also services over broadband IP, over broadcast channels and over optical media may benefit from the low bandwidth at which ISO/IEC 14496-17 can provide subtitling and Karaoke song texts to customers.


MPEG-4 part 17 ISO/IEC 14496-17: 2005, Information Technology – Coding of audio-visual objects – Part 17: Streaming Text Format
3GPP Timed Text 3GPP TS 26.245: 2003, Timed text format (Release 6)
MPEG-2 Systems ITU-T Rec. H.262 | ISO/IEC 13818-1: Information Technology – Generic coding of moving pictures and associated audio information – Part 1: Systems
RFC 3640 RTP payload for transport of generic MPEG-4 content
RFC xxxx RTP payload for 3GPP Timed Text