The Moving Picture Experts Group

Global Video As Ground Plan For Genomics Data Transmission

by Philip Merrill

Due to the size of base-sequence genetics data that is now routinely collected, technologies developed to support video compression have a lively application to the very different data of genomics. Distinctively, MPEG's use of Huffman and arithmetic coding over more than a generation has led to tools like CABAC - Context-adaptive binary arithmetic coding. These digital entropy technologies have been conventionally used by MPEG as a final stage when compressing video files. The times are changing and genomics changes the role of time because genomics data represents static spatial organizations of molecules. Compressing digital video data at this entropy coding step has relied on mathematical tools equally ready to be applied to massive genomics data now actively collected and stored.

One old rule of thumb is that zipping a file losslessly reduced its file-size by 50 percent. While clearly an oversimplification, that reliable one-size-fits-all approach is very different from entropy encoding. The data itself controls how it can be crunched. For example, if we think of the ones as meaningful and the zeroes as padding, we want to get some really representative ones tightly packed in a smaller file with instructions such that it can be unpacked to reproduce the original arrangement, losslessly. This general technology was included in MPEG's first work and can be used outside the video context to condense the description of patterns formed by the nucleotide bases, Like video, amino acid chains can be digitally represented in ways that are entirely content-dependent, basing the compression of the digital version on what is inherently there in the sequence, especially repetitive patterns. The shortest descriptions in entropy coding are the most efficient and so are assigned to patterns that occur most frequently in the data. It is noteworthy that entropy coding itself was not the barrier to designing the large-scale plan because so many other interoperable pieces were required, hence the utility of MPEG’s standards library and MPEG’s practice combining the ingredients into novel integrations.

The six pieces of MPEG-G have developed since 2016 in collaboration with ISO TC 276, Working Group 5 for Biotechnology, Martin Golobiewski convenor. The first governs file format as well as general architecture for the larger ground plan. The second pertains specifically to the genomic compression itself, as surveyed above. The third part sets up the API structure so that queries and responses can be exchanged online (or within an in-house network). The fourth and fifth part have just progressed to Draft International Standards at October's past 128th meeting and provide reference software as well as a conformance standard, together enabling testbeds and fuller deployments to commence. Sixth is for annotation within the architecture and a call for proposals was made this past July at the 127th meeting, expected to be evaluated at the upcoming 129th meeting. As we explore the architecure's dimensions in more depth, it is satisfying to see that the integration of these video and genomics technologies has been able to proceed so smoothly.

To fly over the ways MPEG-G is expected to be used, first imagine the essential transmission capabilities of streaming and encryption, when desired, that we have all learned to expect from quality video. At the places where biological data is generated, privacy protections must be fundamental while at the same time it is necessary to be able to aggregate records and studies. However these are packaged, the privacy protections must remain intact without creating inefficiencies. In clinical locations access to data can be selectively screened, making it easy to see what is appropriate and blocking protected data from access. At locations where digital data is stored, interoperability with legacy systems must at least provide pathways to compatibility. This includes outputs from new generations of high-throughput genomic sequencers that bring us back to the need for MPEG-G in the first place. So many powerful sequencers are providing more and more detailed data than ever before and existing methods for compressing that data have not been nearly as efficient as what MPEG-G provides. The ability to perform necessary conversions has been part of the structural design for this new genomic data, as is the interesting presentation of results in side-by-side stacked data formats. These address the way pieces of the genetic puzzle picture get formed together, after sequencing, or else other motivations for such close comparisons between shorter strings of nucleotides.

The ability to comment on all this and have the comments handled both efficiently and also with privacy protections opens up an extensible metadata direction still in its early stages. Whether handled with lossy approaches or losslessly, part of the point of managing all this data is so that doctors and scientists can look at it and leave behind annotational comments. For those familiar with the efforts of bibliography, this interconnecting conversation between users and data needs to be able to keep new annotational information separate to some degree. For one example, there should be no need to make a redundant copy of the genomic data. Also, it has been determined that compressed files must be easily combined or concatenated and also that linkages and annotations can be made appropriately while the data remains compressed, for example at its primary storage location. Another dimension is presented by incremental updates which are always the big challenge of large, living databases. In some general cases not specifically related to MPEG-G, the addition of updates in realtime can compromise overall database performance or increase errors. Practically then one of MPEG-G's significant goals continues to be the ability for such incremental updates to be made easily and harmlessly, once again avoiding unnecessary decompression of the genomic data itself. These are not awkwardly large packages, hard to send, that must be unpacked and then repacked before sending out. This is because the metadata allows the management of what is needed to be more granular than that. If someone has the genomic data, why access it? If someone has comments to add, appropriately, making such additions must be streamlined to be as seamless as possible. Thereby producing more useful time for doctors and scientists to pursue the medical side rather than the computer side of genomics.

Similar to MPEG's original few standards, specifying decoder conformance is key. This allows a diverse host of solutions for optimization of encoding. On the digital technology side, this also incentivizes the creation of gear and boxes of electronics that can handle the conforming data. We have all seen it work for video. The implications of success, transforming society's use of our genetic information and those of other life forms around us, are likely to improve our core abilities to understand and heal life. It is easily argued that is already happening in top biology labs at this moment.

November 2019