The evolving role of metadata in video workflows

Interview with Karsten Schragmann, Head of Product Management at Vidispine – An Arvato Systems Brand

Standards & Services

FKT: Is metadata more important than ever in media production and distribution?

Karsten Schragmann: It is metadata that drives the ROI on our media assets.
First, the more knowledge we have about an asset, the greater value we can get from it. A really simple example would be basic descriptive metadata such as title and description. This enables us to find the assets we own (e.g. in archive) and use and reuse them.

Second, having structured metadata enables us to automate processes. Again, a really simple example would be knowing the codec and resolution of a file; based on this metadata we can make run-time decisions in automated workflows about how to handle that asset.  As a more advanced example, if the metadata identifies “highlights” in an asset, we can automate the creation of an EDL to send to the editor, or even automate the whole highlights creation process.

So, metadata enables us to drive down the cost of producing and distributing assets, while at the same time increasing the attainable value of those same assets.

FKT: How is metadata evolving?

Karsten Schragmann: Modern television production processes not only have requirements for well-defined metadata models. Due to increasingly data-driven media production, the need for content-based automated metadata generation is also increasing.

There are three areas where metadata has evolved significantly as a result.

First, standardization. While there’s still a huge amount of variability between departments/ facilities/ geographies, the metadata landscape today is a million miles from the “Wild West” of 15-20 years ago. Much of the standardization in the field though has come about indirectly through other developments and standardization efforts in other parts of the media lifecycle. For example, the consolidation in file formats used in media workflows — primarily now to MXF, and further constrained to the application specifications defined by the DPP/AMWA or the ARD-ZDF MXF profiles — has resulted in a standard taxonomy for structural metadata. Similarly, as the need to transmit certain metadata, such as captioning, alongside video and audio grew, there have been further efforts to standardize this and other ancillary data.

Driven by the need to assure quality when exchanging files and the increased utilization of automated media analysis tools, another area that saw significant developments in standardization was QC. Again, this has been driven by organizations such as the DPP, IRT, and the EBU QC project.

It’s 15 years now since UK start-up Vqual was the first to hit the broadcast market with a file-based video and audio analyzer, Cerify. The biggest strength, but also the weakness of Cerify and similar tools at the time, was that it could (and often did if configured incorrectly) produce data about every frame, macroblock, and pixel in a file. Making use, or even making sense, of this much data often presented as much of a problem as automated QC could solve. But as media analysis tools have evolved, that data has become increasingly useful. Artificial Intelligence and machine learning have brought further innovation and even more data from video and audio analysis, though these have once more presented challenges in managing and, crucially, capitalizing on this data.

Finally, it’s in the way that metadata is used today. Run-time decisions are made during automated workflows based on existing metadata, or metadata generated or updated during workflows. These decisions can be quite simple, for instance, if the metadata shows that the media is not in a “house format” to dynamically add a transcode step into an ingest workflow. There are more advanced uses too, for example to “fast track” processing if the result of a contextual analysis of a transcription (perhaps from a separate speech-to-text analysis) matches trending keywords.

FKT: Which metadata types and models should be distinguished?

Karsten Schragmann: There are mainly three types of video content metadata that one needs to be aware of to make assets more easily discoverable:

Descriptive Video Metadata
Descriptive video content metadata includes any information which describes the assets that are used for later identification and discovery. Descriptive video metadata examples include:

  • Unique identifiers (such as EIDR, a concept similar to ISBN, but for digital objects)
  • Physical/Technical attributes (such as file dimensions, color codes, or file types)
  • Bibliographic/Added attributes (such as descriptions, title, and relevant keywords)

Descriptive video metadata is the most well-known type of metadata and is often described as the most robust one too because there are simply so many ways to describe an asset.

Structural Video Metadata
Structural video metadata is the data that indicates how a specific asset is organized, in much the same way that the pages of a book are organized to form chapters. Structural video content metadata also indicates if the specific asset is part of a single collection, or multiple collections, making it easier to navigate and present the information in an electronic source. Structural video metadata examples include sections, video chapters,
indexes, tables of contents.

Structural video metadata is, apart from basic organization, the key to document the relationship between two assets.

Administrative Video Metadata
Administrative video metadata concerns the technical source of a digital resource and how it can be managed. It is the metadata that relates to rights and intellectual property by providing data and information about the owner, as well as where and how it’s allowed to be used.

NISO (National Information Standards Organization) divides administrative metadata into three sub-categories:

  • Technical Metadata - Necessary information for decoding and rendering files
  • Preservation Metadata - Necessary information for long-term management and archiving of digital assets
  • Rights Metadata - Information regarding intellectual property and usage rights.

An administrative video metadata example would be a Creative Commons license.

FKT: There are sometimes problems when, for example, a video contribution is not consistently annotated in different systems and databases. Is standardization a way out here?

Karsten Schragmann: I have already mentioned a few standardization initiatives that have dramatically improved the situation. The AMWA/DPP application specifications are a great example here, where the schema and taxonomy of the metadata are specified for file-based content delivery. I also touched on the challenges experienced with automated QC when the first solutions came to market. Initiatives such as the EBU QC project helped a great deal here, though it’s also arguable that there was a de facto standardization that came about through market pressure and vendor consolidation.

With the explosion in AI services and a broad range of vendors entering the space, we face a similar but much less constrained scenario today. A joint ETC-SMPTE task force was set up in the second half of 2020 to look at areas where collaboration and standardization may be beneficial with regards to AI and media. Undoubtedly metadata created from those processes will be one of the areas identified by the task force when they produce their engineering report. However, at that point, we will still be a long way from any standards or recommendations being produced, so it’s possible, or perhaps even likely, that vendor consolidation and market pressure will be the main drivers toward a de facto standard once more.

In the meantime, systems that provide a single point of entry, and unified results from multiple AI services will provide a bridge enabling users to create best-of-breed solutions.

FKT: Automated, AI-based metadata generation techniques are now available to help with the enormous amounts of video and audio data. What are the current options?

Karsten Schragmann: AI-based metadata generation allows the machine to find information inside the video and audio frame itself, in the same way that a human operator can interpret the same content. This, of course, opens up important new possibilities depending on what type of workflow you are managing. A channel distributor can use AI-based metadata generation to automatically find new types of information in a huge amount of media content that could not be processed manually before – and thus use or present those insights to the viewer as a program, highlights, suggested shows, or even as autogenerated trailers. AI-based services can carry this new information as metadata and give your MAM system new and much more granular methods of managing your media files. This is very important in the process of optimizing the performance and capabilities of your evolving media supply chain.

Revenue, and how we can improve revenue is, of course, a driver for the advancement and adaption of cognitive services as it is for most other technology. And once you are getting familiar with the idea of challenging your common view on what machines can do – the subject of revenue by technology gets even more interesting.

FKT: What challenges are there to overcome in AI-based metadata creation?

Karsten Schragmann: One of the biggest challenges introduced with AI-derived metadata is around “quality” – or more precisely around confidence levels. As AI-based analyzers are becoming commonplace in media workflows, we have moved from a position where we had a relatively small amounts of metadata that we trusted, to a situation where we have huge amounts of metadata but with varying degrees of confidence about the accuracy of that data. Confidence, and confidence thresholds, now play an important role in our workflows – potentially with different thresholds existing in different workflows or different parts of the organization. Managing these confidence thresholds is key to the usefulness of the metadata.

A second challenge with AI-based metadata creation is the sheer volume it produces. When architecting our plans for integrating cognitive services a few years ago, our colleague, Ralf Jansen, stated an ambition to “know every detail about every frame.” There is only value in understanding, and/or documenting, a detail if it first adds value to the content or saves cost in the production process. We also need to be able to access that data.

As a result, there needs to be a unified service approach in place within the MAM architecture. There are a growing number of cognitive service providers and a MAM system needs to be able to not only make room for additional layers of temporal metadata but also conform cognitive metadata from many different providers into a common structure. This is important, because there will be different trained models for different purposes, and you want to use and combine cognitive services from different providers to improve your media supply chain’s capabilities and performance.

In VidiNet Cognitive Services we have defined a standard structure for the cognitive metadata from different providers. Because of this, customers don’t have to care about how to model and integrate the different metadata results coming back from different providers

FKT: How will the quality of metadata and long-term benefits progress?

Karsten Schragmann: Customers in the future will recognize the importance of service-agnostic basic metadata extraction that offers various types of cognitive recognition models while being able to unify all this metadata into a single MAM system and media supply chain to drive business intelligence. Especially in the field of computer vision, there should be a simple way of training your own current and regional concepts with just a few examples of training data integrated in the MAM directly.

And, once all necessary time accurate information is available, we can build value-added services on top that span a wide range from content intelligence, such as search and monetize content; content recommendation due to genealogy pattern; real-time assistance systems based on rights ownership; recommendations while cutting for target program slots based on rating predictions; content compliance to automatically highlight cuts of content; domain-specific archive tagging packages; similarity search in respect to owned licenses; and much, much more.

Customers will start to find their own applications and use cases for cognitive services. And there are already growing numbers of cognitive service providers out there at the same time with unique or overlapping functionality.

Interview: Martin Braun