Welcome to the ISO/IEC JTC 1/SC 29/WG 11 web site

also know as MPEG, the Moving Picture Experts Group.

The MPEG acronym is also used to indicate a suite of

ISO/IEC digital media standards developed by this JTC 1 Working Group.

iso-iec-logo

The Moving Picture Experts Group

Audio

Standard: 
Part number: 
4
Activity status: 
Closed
Technologies: 

 

MPEG-7 Audio

MPEG doc#: N7708
Date: October 2005
Author:

 

Low level descriptions

1. Introduction

 

The MPEG-7 Audio standard contains description tools for audio describing content. The extraction of low level descriptors is normative and is based on the audio signal itself. With the help of low level descriptors it is possible to search and filter audio content in regard to for e.g. spectrum, harmony, timbre and melody.

2. Low Level Tools

The low-level audio descriptors (LLD) are useful in describing audio. There are seventeen temporal and spectral parameters that can be divided into following groups:

  • Basic: Instantaneous waveform and power values
  • Basic spectral: Log-frequency power spectrum and spectral features (for e.g. spectral centroid, spectral spread, spectral flatness)
  • Signal parameters: fundamental frequency and harmonicity of signals
  • Temporal Timbral: Log attack time and temporal centroid
  • Spectral Timbral: specialized spectral features in a linear frequency space…
  • Spectral basis representations: a number of features used in conjunction for sound recognition for projections into a low-dimensional space.

An LLD can be instantiated as a single value for an audio segment or as series. In MPEG-7 Audio exist therefore two different LLD Types. AudioLLDScalarType is useful for scalar values as power or fundamental frequency. AndioLLDVectorType can be used for vector types as spectra. Any descriptor that is inherited from one of the two types can be instantiated.

The samples can be further manipulated using ScalableSeries. ScalableSeries allow a downsampling of the data. They are able to store various kinds of summaries such as minimum, maximum, mean, variance…

All low level audio descriptors are based on either AudioLLDScalarType or AndioLLDVectorType. It has been defined a default sampling period of 10 ms within this types (the hopSize). All descriptors should take this hopSize or an integer multiple of it. 10 ms have been chosen to maximize the compatibility with common audio sampling frequencies.

The next figure shows the class hierarchy for MPEG-7 low-level audio descriptors.

Figure 1: class hierarchy of MPEG-7 Audio Low Level Descriptors

3. Applications

It has been often said that MPEG-7 will make the web more searchable for multimedia content than it is for text today. This would also apply to making large content archives accessible to the public (or to enable people to identify content to buy). The same information used for content retrieval may also be used by agents, for selection and filtering of broadcast or "push" material. Additionally, the meta-data may be used for more advanced access to the underlying data, by enabling automatic or semi-automatic multimedia presentation or editing.

4. References

 

High level descriptions

 

MPEG doc#: N7709
Date: October 2005
Author:

 

 

 

1. Introduction of MPEG-7 High Level Tools  

 

The MPEG-7 Audio standard contains description tools for high level audio tools. Among this category fall all descriptors with a higher semantic hierarchy. Their extraction is mostly non-normative. Normative is only the XML output and sometimes the  field of application.

One example for high level descriptors is the Melody descriptor. Many different extraction algorithms and implementation proposals of MPEG-7 melodies can be found in technical literature, but only its result is standardized. Therefore, with the help of high level tools decoding and encoding of metadata technologies is possible, but the extraction performance can vary depending on the underlying algorithm.

2. Overview of Existing Tools

·        General sound recognition and indexing tools

These indexing tools are a collection of tools that can index or categorize music, sound effects or sounds in general. This can be similarity measure or genre recognition.

·        Spoken content description tools

This approach assumes that today’s speech recognition systems are imperfect. Thus it can sometimes be difficult only to transmit the textual description. This description scheme consists of word and phone lattices combined for each speaker in an audio system. By combining these lattices retrieval may still be possible, despite the original decoding might have been in error.

·        Musical instrument timbre description tools

These description tools have been proposed for describing perceptual features of instrument sounds. They relate to notions as ‘attack’, ‘brightness’ or ‘richness’ of a sound.

·        Melody description tools

Melody description tools support the representation of monophonic melodies. These tools consist of two different approaches in representing melody lines. The first approach (Melody Contour Description Scheme) uses a 5-step contour in which the intervals are quantized. Notes are stored in numbers depending on the underlying scale. The second approach (Melody Sequence Description Scheme) saves the exact pitch interval between adjacent notes. Therefore quantizing is not necessary.

3. Applications

It has been often said that MPEG-7 will make the web more searchable for multimedia content than it is for text today. This would also apply to making large content archives accessible to the public (or to enable people to identify content to buy). The same information used for content retrieval may also be used by agents, for selection and filtering of broadcast or "push" material. Additionally, the meta-data may be used for more advanced access to the underlying data, by enabling automatic or semi-automatic multimedia presentation or editing.

There have been developed many applications using MPEG-7 technology lately.

One of them that are used for audio identification is called AudioID. It uses the descriptor AudioSignatureType and its fingerprint is transmitted from the content provider, who extracts the fingerprint to the client, who uses it for music retrieval. That can be broadcast service stations, that do broadcast monitoring or cellular phone provider, that perform music recognition via cellphone.

4. References