MPEG doc#: N9769
Date: April 2008
1 Basic Structures
This group consists of six supporting tools for the MPEG-7 visual descriptors of elementary features such as color, texture and shape. The tools can be categorized into descriptor containers and basic supporting tools. The former category consists of four datatypes, namely GridLayout providing efficient representations of visual features on grids, VisualTimeSeries representing temporal arrays of several descriptions, MultipleView describing a 3D object using several pictures captured from different view angles, and GoFGoPFeature used to describe a certain visual features representative of a series of video frames or collection of pictures. The latter category contains two basic supporting tools, namely Spatial2DCoordinateSystem used to specify the 2D coordinate system and TemporalInterpolation indicating the interpolation method between two samples on a time axis.
2 Color Description Tools
Color is the most basic attribute of visual contents. MPEG-7 Visual defines various color descriptors and supporting tools. There are five color descriptors to represent different aspects of color features: DominantColor for representative colors, ScalableColor for basic color distribution, ColorLayout for global spatial distribution of colors, ColorStructure for local spatial distribution of colors, and ColorTemperature describing the perceptual temperature feeling of an image. In addition, the GoFGoPColor descriptor is defined as an extension of ScalableColor to groups of frames or pictures. Finally, three supporting tools are defined: ColorSpace, Color Quantization, and IlluminationInvariantColor, the latter providing the means to achieve illumination invariance with existing color descriptors. All the descriptors and tools are applicable to arbitrarily shaped regions.
The DominantColor descriptor characterizes an image or region by a small number of representative colors. These are selected by quantizing pixel colors into (up to eight) principal clusters. The description then consists of the fraction of the image or region represented by each color cluster and the variance of each one. A measure of overall spatial coherency of the clusters is also defined. This descriptor provides a very compact description of the representative colors in an image.
The ScalableColor descriptor is a color Histogram in the HSV color space, which is encoded by a Haar transform. It has a binary representation that is scalable, in terms of bin numbers and bit representation accuracy, over a broad range of granularity. Retrieval accuracy can therefore be balanced against descriptor size. Inversion of the Haar transform is not necessary for performing descriptor comparisons, since similarity matching is also effective in the transform domain.
The ColorLayout descriptor represents the spatial layout of color images in a very compact form. It is based on generating a tiny (8x8) thumbnail of an image, which is encoded via DCT and quantized. As well as efficient visual matching, this also offers a quick way to visualize the appearance of an image, by reconstructing an approximation of the thumbnail, by inverting the DCT.
The ColorStructure descriptor captures both color content and information about the spatial arrangement of this color content. Specifically, it is a histogram that counts the number of times a color is present in an 8x8 windowed neighborhood, as this window progresses over the image rows and columns. This enables it to distinguish, for example, between an image in which pixels of each color are distributed uniformly and an image in which the same colors occur in the same proportions, but are located in distinct blocks.
The ColorTemperature descriptor describes the perceptual temperature feeling of an image. It targets the perception-based image browsing that enables viewers to navigate and match images based on the temperature perception (i.e. hot, warm, moderate and cool) of an image. This descriptor is also useful when a user would like to change the illumination of a scene (i.e. still images or video) in favor of the user’s preference.
The GoFGoPColor descriptor specifies a structure required for representing the color features of a collection of (similar) images or visdeo frames by means of the ScalableColor descriptor. The collection of video frames can be a contiguous video segment or a non-contiguous collection of similar video frames.
IlluminationInvariantColor is a supporting tool in the color description tool group. It is a container and can extend four color descriptors – DominantColor, ScalableColor, ColorLayout and ColorStructure – to support illumination invariant similarity matching.
3 Texture Description Tools
Texture is a powerful low-level descriptor for image search and retrieval. MPEG-7 Visual defines three texture descriptors. HomogeneousTexture provides a quantitative description of homogeneous texture regions based on the local spatial-frequency statistics of the texture. TextureBrowsing specifies the perceptual characterization of a texture which is similar to a human characterization, in terms of regularity, coarseness and directionality. Finally, EdgeHistogram specifies the spatial distribution of five types of edges in local image regions.
The HomogeneousTexture descriptor is designed to characterize the properties of texture in an image (or region), based on the assumption that the texture is homogeneous – i.e., the visual properties of the texture are relatively constant over the region. The descriptive features are extracted from a bank of orientation- and scale-tuned Gabor filters.
The TextureBrowsing descriptor is useful for representing homogeneous texture for browsing type applications. This descriptor, combined with the HomogeneousTexture descriptor, provides a scalable solution to representing homogeneous texture regions in images.
The EdgeHistogram descriptor represents the spatial distribution of five types of edges (four directional edges and one non-directional). It consists of local histograms of these edge directions, which may optionally be aggregated into global or semi-global histograms.
4 Shape Description Tools
Shape features relate to the spatial arrangement of points (pixels) belonging to an object or region. Shape descriptors can be divided into two broad classes: 2-dimensional (2D) and 3-dimensional (3D). MPEG-7 Visual defines six shape descriptors. Three descriptors characterize 2D objects or regions: RegionShape captures the distribution of all pixels within a region. ContourShape characterizes the shape properties of a contour of an object. ShapeVariation describes the variation of shape in a collection of binary images of objects. In addition, three descriptors characterize 3D shapes: Shape3D provides an intrinsic characterization of 3D mesh models. Perceptual3DShape provides a part-based representation of a 3D object expressed as a graph. Finally, the MultipleView descriptor combined with a 2D descriptor may also be used. Such a representation is convenient when the 3D model of an object is not known or when support for queries by 2D views of the 3D object is required.
The RegionShape descriptor specifies the region-based shape of an object. The shape of an object may consist of either a single region or a set or regions, as well as some holes in the object. Since the regions-based descriptor makes use of all pixels making up the shape, it can describe any complex shape. The region-based shape descriptor utilizes a set of ART (Angular Radial Transform) coefficients.
The ContourShape descriptor specifies a closed contour of a 2D object or region in an image or video sequence. The object contour-based shape descriptor is based on the Curvature Scale Space (CSS) representation of the contour. This representation of contour shape is very compact, with an average size of below 14 bytes.
The Shape3D descriptor provides an intrinsic shape description for 3D mesh models, by exploiting some local attributes of the 3D surface.
The Perceptual3DShape descriptor is a part-based representation of a 3D object expressed as a graph. Such a representation facilitates object description consistent with human perception. The Perceptual3DShape descriptor supports functionalities like “Query by sketch” and “Query by editing”, which would make a content-based retrieval system more interactive and efficient in querying and retrieving similar 3D objects.
5 Motion Description Tools
Motion in a sequence of 2D images can be induced by camera motion, object motion, or both. MPEG-7 Visual defines four descriptors to characterize various aspects of motion: CameraMotion specifies a set of basic camera operations such as panning and tilting. Motion of a key point (pixel) from a moving object or region can be characterized by the MotionTrajectory descriptor. The ParametricMotion descriptor characterizes an evolution of an arbitrarily shaped region over time in terms of 2D geometric transformations. Finally, MotionActivity captures the pace and motion in the sequence as perceived by the viewer. All motion descriptors with the exception of CameraMotion can be applied to arbitrarily shaped regions.
This descriptor characterizes 3D camera motion parameters. It is based on 3D camera motion parameter information, which can be automatically extracted or generated by capture devices. The camera motion descriptor supports the following well-known basic camera operations: fixed, panning, tracking, tilting, booming, zooming, dollying, and rolling.
The motion trajectory of an object is a simple, high-level feature, defined as the localization, in time and space, of one representative point of this object. This descriptor shows usefulness for content-based retrieval in object-oriented visual databases.
The parametric model is associated with arbitrary (foreground or background) objects, defined as regions (group of pixels) in the image over a specified time interval. Such an approach leads to a very efficient description of several types of motions, including simple translation, rotation and zoom, or more complex motions such as combinations of the above-mentioned elementary motions.
This descriptor captures the intuitive notion of ‘intensity of action’ or ‘pace of action’ in a video segment. This descriptor is useful for applications such as video re-purposing, surveillance, fast browsing, dynamic video summarization, content-based querying, etc.
6 Localization Description Tools
The localization description tools can be used to indicate arbitrarily shaped regions of interest in the spatial (RegionLocator) and spatio-temporal (SpatioTemporalLocator) domains.
7 Face Identity Description Tools
Face is one of the most important semantics embedded on visual contents. Two description tools are specified by MPEG-7 Visual to describe facial identity; one is FaceRecognition and the other is AdvancedFaceRecognition. The difference between these two tools is that AdvancedFaceRecognition provides additional characteristics of robustness against variations in pose and illumination conditions.
8 Image Signature Tools
Date: January 2011
Authors: Miroslaw Bober and Stavros Paschalakis
In recent years people have been generating and distributing an ever increasing amount of image data. A recent survey of prominent web sites shows that Flickr has over 2 billion images, Photobucket has over 4 billion and Facebook has 1.7 billion. There are hundreds of billions of images on the Internet, and even users’ personal databases can have tens or hundreds of thousands of images. At the same time, there are few tools which one can use to efficiently identify or search for a specific image, possibly in an edited or modified form, either on the Internet or in one’s own personal collection. The MPEG-7 Image Signature Tools address this problem by providing an interoperable solution for image identification.
In contrast to previous MPEG-7 visual descriptors which were designed to provide access to similar content, the Image Signature is a content-based descriptor designed specifically for image identification, i.e. designed for the fast and robust identification of the same or modified image in web-scale or personal databases. Such a type of descriptor is also commonly known as a fingerprint and has a strong advantage over watermarking techniques in that it does not require any modification of the content and can be used readily with all existing content.
The applications for the MPEG-7 Image Signature are numerous. These include media usage monitoring, e.g. tracking and recording statistics such as distribution and frequency of content usage, web-page linking, e.g. using images to imply links between web-pages, as is currently done for text, rights management and monetization, e.g. detection of possible copyright infringement or content monetization online (for content owners) or identification of the copyright owner (for content consumers), and personal image collection management and de-duplication.
The Image Signature is the result of extensive collaborative effort within MPEG-7, aiming at the delivery of an optimised interoperable image identification solution. In order to achieve fast searching and high robustness, the Image Signature combines two complementary approaches in image representation: a global signature, where the signature is extracted from the entire image, and a local approach, where a set of local signatures are extracted at salient points in the image. In terms of content identification performance, the MPEG-7 evaluation process tested the robustness of the Image Signature to a wide range of common modifications, such as text/logo overlay, rotation, cropping, colour changes, etc., and achieved an overall success rate of ~99.29% at a false alarm rate of less than 0.05 parts-per-million for the global signature, and ~98.04% at a false alarm rate of less than 10 parts-per-million for the complete signature. These success rates are achieved with search speeds in the order of 80 million and 100,000 matches per second for the global and complete signatures respectively. The Image Signature is also extremely compact, at only 1024 bits per image for the global signature and up to 7424 bits for the complete signature.
The MPEG-7 Image Signature Tools comprise four amendments to the MPEG-7 standard, specifying the extraction, decoding and syntax of the Image Signature , providing a reference software implementation , specifying the conformance conditions and dataset , and describing the Image Signature matching procedure that was used in the MPEG-7 evaluation process. These resources will ease the development of systems that comply with standard, and it is anticipated that the Image Signature Tools will find wide adoption in image identification applications.
 ISO/IEC 15938-3:2002/AMD 3:2009, Information Technology – Multimedia content description interface – Part 3: Visual, Amendment 3: Image signature tools
 ISO/IEC 15938-6:2003/AMD 3:2010, Information Technology – Multimedia content description interface – Part 6: Reference software, Amendment 3: Reference software for image signature tools
 ISO/IEC 15938-7:2003/AMD 5:2010, Information Technology – Multimedia content description interface – Part 7: Conformance testing, Amendment 5: Conformance testing for image signature tools
 ISO/IEC 15938-8:2002/AMD 5:2010, Information Technology – Multimedia content description interface – Part 8: Extraction and matching of image signature tools
9 Video Signature Tools
MPEG doc#: N11824
Date: January 2011
Authors: Stavros Paschalakis and Miroslaw Bober
The amount of video content generated and consumed by users has been increasing at a spectacular pace in recent years. In 2010, according to figures provided by the company itself, people were uploading hundreds of thousands of videos daily on YouTube, at a rate of 24 hours of content every minute, and watching 2 billion videos a day. Despite the vast amount of video data in existence, there are few tools which one can use to efficiently identify or search for a specific piece of video content, possibly in an edited or modified form, either on the Internet or in one’s own personal collection. The recently standardised MPEG-7 Video Signature Tools address this problem by providing an interoperable solution for video content identification.
Unlike previous MPEG-7 visual descriptors which were designed to provide access to similar content, the Video Signature is a content-based descriptor designed specifically for content identification, i.e. designed for the fast and robust identification of the same or modified video content in web-scale or personal databases. Descriptors such as the Video Signature are also commonly known as fingerprints and have a strong advantage over watermarking techniques in that they do not require any modification of the content and can be used readily with all existing content.
The applications for the MPEG-7 Video Signature are numerous, including media usage monitoring, e.g. tracking and recording statistics such as distribution and frequency of content usage, web-page linking, e.g. using video content to imply links between web-pages, as is currently done for text, rights management and monetization, e.g. detection of possible copyright infringement or content monetization online (for content owners) or identification of the copyright owner (for content consumers), and personal video collection management and de-duplication.
The newly standardised Video Signature is the result of extensive collaborative effort within the MPEG-7 group of experts, with the common aim of delivering an optimised interoperable video content identification solution. Key technical aspects of the MPEG-7 Video Signature include a combined dense (video-frame-level) and sparse (video-segment-level) description approach, allowing flexible multi-stage matching schemes, and a custom descriptor compression scheme, to facilitate efficient storage and transmission of the Video Signature metadata. In terms of content identification performance, the MPEG-7 evaluation process tested the robustness of the Video Signature to a wide range of common modifications, such as text/logo overlay, camera capture (camcording), compression al low bitrates, resolution reduction, frame rate changes, etc., and achieved an overall success rate of ~95.49% at a false alarm rate of less than 5 parts-per-million. The Video Signature has also been designed to allow very high extraction and matching speeds and has very low storage and transmission requirements, at only ~2MB per hour of video content.
The MPEG-7 Video Signature Tools comprise four amendments to the MPEG-7 standard, specifying the extraction, decoding and syntax of the Video Signature , providing a reference software implementation , specifying the conformance conditions and dataset , and describing the Video Signature matching procedure that was used in the MPEG-7 evaluation process. These resources will ease the development of systems that comply with standard, and it is anticipated that the Video Signature Tools will find wide adoption in video content identification applications.
 ISO/IEC 15938-3:2002/AMD 4:2010, Information Technology – Multimedia content description interface – Part 3: Visual, Amendment 4: Video signature tools
 ISO/IEC 15938-6:2003/FPDAM 4, Information Technology – Multimedia content description interface – Part 6: Reference software, Amendment 4: Reference software for video signature tools
 ISO/IEC 15938-7:2003/FPDAM 6, Information Technology – Multimedia content description interface – Part 7: Conformance testing, Amendment 6: Conformance testing for video signature tools
 ISO/IEC 15938-8:2002/DAM 6, Information Technology – Multimedia content description interface – Part 8: Extraction and matching of video signature tools