INTERNATIONAL ORGANISATION FOR STANDARDIZATION
ORGANISATION INTERNATIONALE NORMALISATION
ISO/IEC JTC1/SC29/WG11
CODING OF MOVING PICTURES AND AUDIO
ISO/IEC JTC1/SC29/WG11
MPEG2008/N9769
April 2008, Archamps, FR
Source: MPEG Video Sub-Group
Status: Approved
Title: Overview of MPEG-7 Visual Description Tools
Overview of MPEG-7 Visual Description Tools
This group consists of six supporting tools for the MPEG-7 visual descriptors of elementary features such as color, texture and shape. The tools can be categorized into descriptor containers and basic supporting tools. The former category consists of four datatypes, namely GridLayout providing efficient representations of visual features on grids, VisualTimeSeries representing temporal arrays of several descriptions, MultipleView describing a 3D object using several pictures captured from different view angles, and GoFGoPFeature used to describe a certain visual features representative of a series of video frames or collection of pictures. The latter category contains two basic supporting tools, namely Spatial2DCoordinateSystem used to specify the 2D coordinate system and TemporalInterpolation indicating the interpolation method between two samples on a time axis.
Color is the most basic attribute of visual contents. MPEG-7 Visual defines various color descriptors and supporting tools. There are five color descriptors to represent different aspects of color features: DominantColor for representative colors, ScalableColor for basic color distribution, ColorLayout for global spatial distribution of colors, ColorStructure for local spatial distribution of colors, and ColorTemperature describing the perceptual temperature feeling of an image. In addition, the GoFGoPColor descriptor is defined as an extension of ScalableColor to groups of frames or pictures. Finally, three supporting tools are defined: ColorSpace, Color Quantization, and IlluminationInvariantColor, the latter providing the means to achieve illumination invariance with existing color descriptors. All the descriptors and tools are applicable to arbitrarily shaped regions.
The DominantColor descriptor characterizes an image or region by a small number of representative colors. These are selected by quantizing pixel colors into (up to eight) principal clusters. The description then consists of the fraction of the image or region represented by each color cluster and the variance of each one. A measure of overall spatial coherency of the clusters is also defined. This descriptor provides a very compact description of the representative colors in an image.
The ScalableColor descriptor is a color Histogram in the HSV color space, which is encoded by a Haar transform. It has a binary representation that is scalable, in terms of bin numbers and bit representation accuracy, over a broad range of granularity. Retrieval accuracy can therefore be balanced against descriptor size. Inversion of the Haar transform is not necessary for performing descriptor comparisons, since similarity matching is also effective in the transform domain.
The ColorLayout descriptor represents the spatial layout of color images in a very compact form. It is based on generating a tiny (8x8) thumbnail of an image, which is encoded via DCT and quantized. As well as efficient visual matching, this also offers a quick way to visualize the appearance of an image, by reconstructing an approximation of the thumbnail, by inverting the DCT.
The ColorStructure descriptor captures both color content and information about the spatial arrangement of this color content. Specifically, it is a histogram that counts the number of times a color is present in an 8x8 windowed neighborhood, as this window progresses over the image rows and columns. This enables it to distinguish, for example, between an image in which pixels of each color are distributed uniformly and an image in which the same colors occur in the same proportions, but are located in distinct blocks.
The ColorTemperature descriptor describes the perceptual temperature feeling of an image. It targets the perception-based image browsing that enables viewers to navigate and match images based on the temperature perception (i.e. hot, warm, moderate and cool) of an image. This descriptor is also useful when a user would like to change the illumination of a scene (i.e. still images or video) in favor of the user’s preference.
The GoFGoPColor descriptor specifies a structure required for representing the color features of a collection of (similar) images or visdeo frames by means of the ScalableColor descriptor. The collection of video frames can be a contiguous video segment or a non-contiguous collection of similar video frames.
IlluminationInvariantColor is a supporting tool in the color description tool group. It is a container and can extend four color descriptors – DominantColor, ScalableColor, ColorLayout and ColorStructure – to support illumination invariant similarity matching.
Texture is a powerful low-level descriptor for image search and retrieval. MPEG-7 Visual defines three texture descriptors. HomogeneousTexture provides a quantitative description of homogeneous texture regions based on the local spatial-frequency statistics of the texture. TextureBrowsing specifies the perceptual characterization of a texture which is similar to a human characterization, in terms of regularity, coarseness and directionality. Finally, EdgeHistogram specifies the spatial distribution of five types of edges in local image regions.
The HomogeneousTexture descriptor is designed to characterize the properties of texture in an image (or region), based on the assumption that the texture is homogeneous – i.e., the visual properties of the texture are relatively constant over the region. The descriptive features are extracted from a bank of orientation- and scale-tuned Gabor filters.
The TextureBrowsing descriptor is useful for representing homogeneous texture for browsing type applications. This descriptor, combined with the HomogeneousTexture descriptor, provides a scalable solution to representing homogeneous texture regions in images.
The EdgeHistogram descriptor represents the spatial distribution of five types of edges (four directional edges and one non-directional). It consists of local histograms of these edge directions, which may optionally be aggregated into global or semi-global histograms.
Shape features relate to the spatial arrangement of points (pixels) belonging to an object or region. Shape descriptors can be divided into two broad classes: 2-dimensional (2D) and 3-dimensional (3D). MPEG-7 Visual defines six shape descriptors. Three descriptors characterize 2D objects or regions: RegionShape captures the distribution of all pixels within a region. ContourShape characterizes the shape properties of a contour of an object. ShapeVariation describes the variation of shape in a collection of binary images of objects. In addition, three descriptors characterize 3D shapes: Shape3D provides an intrinsic characterization of 3D mesh models. Perceptual3DShape provides a part-based representation of a 3D object expressed as a graph. Finally, the MultipleView descriptor combined with a 2D descriptor may also be used. Such a representation is convenient when the 3D model of an object is not known or when support for queries by 2D views of the 3D object is required.
The RegionShape descriptor specifies the region-based shape of an object. The shape of an object may consist of either a single region or a set or regions, as well as some holes in the object. Since the regions-based descriptor makes use of all pixels making up the shape, it can describe any complex shape. The region-based shape descriptor utilizes a set of ART (Angular Radial Transform) coefficients.
The ContourShape descriptor specifies a closed contour of a 2D object or region in an image or video sequence. The object contour-based shape descriptor is based on the Curvature Scale Space (CSS) representation of the contour. This representation of contour shape is very compact, with an average size of below 14 bytes.
The Shape3D descriptor provides an intrinsic shape description for 3D mesh models, by exploiting some local attributes of the 3D surface.
The Perceptual3DShape descriptor is a part-based representation of a 3D object expressed as a graph. Such a representation facilitates object description consistent with human perception. The Perceptual3DShape descriptor supports functionalities like “Query by sketch” and “Query by editing”, which would make a content-based retrieval system more interactive and efficient in querying and retrieving similar 3D objects.
Motion in a sequence of 2D images can be induced by camera motion, object motion, or both. MPEG-7 Visual defines four descriptors to characterize various aspects of motion: CameraMotion specifies a set of basic camera operations such as panning and tilting. Motion of a key point (pixel) from a moving object or region can be characterized by the MotionTrajectory descriptor. The ParametricMotion descriptor characterizes an evolution of an arbitrarily shaped region over time in terms of 2D geometric transformations. Finally, MotionActivity captures the pace and motion in the sequence as perceived by the viewer. All motion descriptors with the exception of CameraMotion can be applied to arbitrarily shaped regions.
This descriptor characterizes 3D camera motion parameters. It is based on 3D camera motion parameter information, which can be automatically extracted or generated by capture devices. The camera motion descriptor supports the following well-known basic camera operations: fixed, panning, tracking, tilting, booming, zooming, dollying, and rolling.
The motion trajectory of an object is a simple, high-level feature, defined as the localization, in time and space, of one representative point of this object. This descriptor shows usefulness for content-based retrieval in object-oriented visual databases.
The parametric model is associated with arbitrary (foreground or background) objects, defined as regions (group of pixels) in the image over a specified time interval. Such an approach leads to a very efficient description of several types of motions, including simple translation, rotation and zoom, or more complex motions such as combinations of the above-mentioned elementary motions.
This descriptor captures the intuitive notion of ‘intensity of action’ or ‘pace of action’ in a video segment. This descriptor is useful for applications such as video re-purposing, surveillance, fast browsing, dynamic video summarization, content-based querying, etc.
The localization description tools can be used to indicate arbitrarily shaped regions of interest in the spatial (RegionLocator) and spatio-temporal (SpatioTemporalLocator) domains.
Face is one of the most important semantics embedded on visual contents. Two description tools are specified by MPEG-7 Visual to describe facial identity; one is FaceRecognition and the other is AdvancedFaceRecognition. The difference between these two tools is that AdvancedFaceRecognition provides additional characteristics of robustness against variations in pose and illumination conditions.