Video Surveillance Application Format (VSAF)

MPEG doc#: N11010

Date: October 2009
Author: Gero Bäse

What it does:

The Video Surveillance AF specifies a format for surveillance data storage and exchange comprising video compression, file format definition and metadata by using the components of MPEG-4 AVC, MPEG-4 AVC file format and MPEG-7 as well as UUID meta data constructs.

What it is for:

Tailored at the needs of the surveillance industry a data storage and exchange format is defined providing sufficient flexibility for the different applications in that area.

Several level of meta data representations are being provided for fast automatic processing as well as human readability. Also, other than AVC video data can be referenced in.

Description of Technology

Introduction

The Video surveillance application format (VS AF) intends to provide an initial level of interoperability for video-based surveillance systems. Systems complying with the VSAF specification can exchange and reuse video data and metadata providing information about time and place of recording as well as additionally detailed descriptions of the content.

Motivation

The implementation of ‘next generation’ video surveillance systems is hindered by significant interoperability deficiencies. The command and control centers must operate in very large sites, that are surveyed by a diverse range of equipment that increasingly is supplied by different vendors. The introduction of automatic methods for analysis has increased the domain in which interoperability is a important issue for system integrators.

Overview of technology

The VSAF definition extends the AVC file format, which in turn extends the ISO Base Media File Format.

Each VSAF file can be considered as a fragment of a continuous stream of data. Each fragment is given a unique identification using Universally Unique IDentifiers (UUIDs). Each fragment will also store, as ‘file-level metadata’, the UUID of its predecessor and of its successor. Furthermore, it is also convenient to include a Uniform Resource Identifier (URI), to provide a description of the access mechanism or location. This fragment identification together with camera identification information is provided in binary form to allow for very fast machine access and processing.

Each VSAF fragment comprises one or more video tracks but requires at least one MPEG-4 AVC track. Movie and Track boxes can be used to provide for further segmentation of the video data. Movie fragments can contain a subset of samples of all tracks including initialization data for presentation. This segmentation provides efficient ‘trick play’ functionality, e.g. fast forward up to the currently recorded frame.

Accompanying the video data, different types of metadata can be added. These metadata are represented in XML format following the MPEG-7 specification. To improve the performance of XML metadata transmission, the BiM (Binary MPEG) format can be used.

Metadata are categorized into equipment and scene activity, technical metadata and observation metadata, respectively. The former type describes equipment and settings and the later type describes the video content.

The metadata is split between two levels, a single file-level document and multiple track-level documents. The file-level metadata encompasses the scope covered by the track metadata documents.

Every video sample is given an individual time stamp. This is stored in binary form in an extended AVC file format as Timed metadata. The time-stamp has nanosecond accuracy and is the time the compilation was created. Since it is possible for the internal camera clocks to drift, the facility to accommodate a time-offset is required. This feature is designated as offset field within the technical metadata.

A video sample can be decomposed into regions denoted by a box or polygon, or a grid layout with an arbitrary number of cells can be applied. Observation metadata can be added for each cell e.g. the dominant colour.

Different users of the Video Surveillance AF (operator, police force) will have their own set terms to describe scenarios, objects, scene components etc. Therefore, user-defined classification schemas (types or classes of objects, attributes, relations or events) can be referenced in. In addition, object reference descriptions can be used for observations in different tracks.

Search form