The Moving Picture Experts Group

Fault Management Framework

Standard: 
Part number: 
6
Activity status: 
Closed
Technologies: 

MPEG Multimedia Middleware (M3W) Fault Management

 

MPEG doc#: N8692
Date: October 2006

Authors: Johan Muskens & Jean Gelissen (Philips)

 

1.      Introduction

MPEG, a working group in ISO/IEC, has produced many important international standards (for example: MPEG-1, MPEG-2, MPEG-4, MPEG-7, and MPEG-21). MPEG feels that it is now important to standardize an Application Programming Interface (API) for Multimedia Middleware (M3W) that is based on a defined set of requirements [1]. This API provides a uniform view to a multimedia middleware platform that can be realized by a number of different vendors.

The M3W specification supports software systems that can be upgraded and extended during their lifetime (possibly at runtime). This means we have a context in which software can be developed by multiple different parties and the software configuration can be modified in the period that a device is already owned and used by a consumer. Software from multiple parties and runtime upgrading / extension are major threats with respect dependability / reliability of the device.

In order to give device vendors some means to control the dependability / correct functioning of their device in such a context, ISO/IEC 23004 Part 6 [2] – Fault Management specifies an optional framework for fault management. The goal is to have a dependable / reliable system in the context of faults. These faults can be introduced due to upgrades and extensions out of control of the device vendor. The faults can also exist because it is impossible to test all traces and configurations in the complex software systems we are building nowadays.

2.      Fault Management Framework

The approach of the Fault Management framework towards dependable software systems is based on adding fault tolerance mechanisms to a sub-system or service instance using wrappers. The wrappers intercept all method invocations on interfaces of the wrapped sub-system / service instance. Furthermore the wrappers intercept all calls of the wrapped sub-systems / service instance on its required interfaces. In other words the wrapper is able to spy on the externally visible behaviour of the wrapped sub-system / service instance.

Dependability is increased by adding a number of fault tolerance mechanisms to the wrapper. This enables the insertion of logic before, after, or around the interactions with the wrapped entity. These fault tolerance mechanisms are usually implemented through a number of standard, but customizable, building blocks that either do error detection or error recovery. An example, and definitely not complete, set of building blocks is given next.

  • Error detection building blocks
  • Timeout (watchdog) can be used to detect that the Service is taking too much time to execute (but it will finish eventually), or the Service entered an infinite loop, or the Service is blocked (on resource, interlocked with other Services, etc.)
  • Hop counter can be used to specify the maximum call depth. This detection mechanism can be used to detect that you are running out of stack space.
  • Buffer overflows / Memory invariant can perform memory CRC at start / end of buffers (provided these buffers are given as argument to the interface function) before calling the un-trusted service and when the service returns, the same check is performed.
  • Pre- and Post condition mechanisms can check method parameters before and respectively after a method call.  A condition can be specified per parameter.  If a condition is not met, the mechanism may override the parameter with a default value.  This may also be considered as an error that will be corrected using one of the recovery building blocks.
  • Error recovery building blocks
  • Retry is a mechanism for hiding temporary errors.  It encapsulates a method with functionality that is transparent to the client.  The mechanism tries the method it encapsulates for a specified maximum number of times before giving up.
  • The service reactivation mechanism instantiates a fresh instance for a service.  The wrapper no longer uses the current instance and rebinds to the new instance
  • Check-pointing and Backward Recovery

Instantiation of services can be intercepted. Instead of only creating an instance of the requested service an additional wrapper can be instantiated. The wrapper is specific for the sub-system or service, but the implementation can be generated completely based on an interface description and fault management description language. This means that no knowledge of the implementation of the wrapped sub-system or service is needed (blackbox). The situation of a wrapped service is depicted below.

A benefit of the uniform way of adding fault tolerance mechanisms to a software system specified in ISO/IEC 23004 Part 6 – Fault Management is that a standard escalation mechanism is provided. Some errors cannot be dealt with locally, but requires a higher level scope. The fault management framework specified by M3W enables escalation by wrappers to a fault manager that has a broader scope and is able to coordinate error handling over a number of wrappers. This introduces the possibility to do error handling at different levels of abstraction for example first at a service level, then at a sub-system level and finally at a system level.

3.      Summary

The Application Programming Interface (API) for Multimedia Middleware specified by M3W provides a uniform view to the Multimedia Middleware in a device. A realization of such a middleware can be provided by a number of different vendors. The M3W specification contains a specification of a fault management framework. The goal is to have a dependable system in the context of faults. The framework is complementary to existing techniques that aim to remove faults from software systems, because in practice it is impossible to remove all faults:

  • Due to upgrading and extension out of control of the device vendor
  • Due to the cost of testing all possible traces of complex systems

The fault management framework provides a flexible and uniform way of adding fault tolerance mechanisms. System developers can for example make the trade-off between:

  • More fault tolerance mechanisms or more testing
  • More fault tolerance mechanisms or no runtime upgrading and extension

Since M3W provides a context that supports runtime upgrading and extension, this also applies for the fault management wrappers. As a consequence it is always possible to add more fault tolerance mechanisms if this is needed, or remove fault tolerance mechanisms when it is observed that they are superfluous.

4.      References

[1]   The Multimedia Middleware (M3W) requirements are in the annex to the Multimedia Middleware (M3W) Requirements Document Version 2.0 (ISO/IEC JTC1/SC29/WG11 N6981). A Call for Proposals derived from these requirements was issued at the 70th MPEG Meeting in Palma, Mallorca

[2]   ISO/IEC 23004, Information Technology — Multimedia Middleware, The M3W standard.