DAM and Flexible Metadata Using Semantics

This is my last in a series of blog posts about finding a flexible solution for managing digital asset metadata. As I mentioned at the beginning of the series, digital asset metadata is a challenge because of its variety. A video asset requires different metadata than an image, and assets sold through one aggregator will require different metadata than assets distributed through another. Business models are also constantly changing, meaning that any data you collect now will need to change in the near future.

Several new types of databases have emerged that provide the flexible solutions that businesses require. While document-oriented databases have commanded the most attention, semantic “triple stores” have been gaining adherents. In this type of database, all information can be conceptualized as a type of graph, with each fact connected to other facts to create a complex web. I should mention that there are other graph-oriented databases that are not based on semantic technology, but I will just focus on triple stores in this discussion.

Semantic Web Concepts

So what exactly is a triple store? At its heart, these databases are based on an information model called the Resource Description Framework (RDF). According to this model, all information consists of three parts: a subject, a predicate and an object.

Consider the statement that an asset’s filename is “flower.png”. In RDF, the asset is considered to be the subject, “flower.png” is the object, and “filename” is the predicate that joins the subject to the object. These three parts are called a “triple” and can be drawn as a simple graph:

Figure 1. RDF triple expressing the fact that an asset’s filename is “flower.png”

Note that in RDF, the concept of an individual asset is represented as a node with the unique identifier “urn:asset:1”. This identifier was chosen for demonstration purposes and has no significance in and of itself. Any value could have been used as long as it was unique and conformed to the naming convention used for Internationalized Resource Identifiers (IRIs). Indeed, best practices call for these identifiers to take the form of web addresses, but I thought that would cause even more confusion for those just being introduced to RDF.

All the examples previously presented in this series can be expressed as RDF triples and depicted in a graph. RDF purists will note that the following graph conflates instances with types in order to make the diagram easier to read:

Figure 2. Graph showing facts about digital assets and their creators

In many ways, triple stores solve the drawbacks of both relational and document databases. They do not impose rigid schemas like relational databases and they do not require that all information be stored inside a single document, as in document databases.

This is not to say that triple stores do not pose problems. Indeed, the inherent flexibility of triple stores can become a challenge. Consider the case where you write a query to find all the digital assets created by John Doe (note that in this and other SPARQL queries, prefixes have been omitted to improve readability):

Without getting too deeply into the syntax of SPARQL, the RDF query language, this is asking for assets that are connected to John Doe via the predicate “Photographer”. The approach seems reasonable, but is flawed because it is relying on a predicate with a very specific semantic meaning. What happens if John Doe has also been shooting videos and the predicate joining those assets to John is “FilmMaker”? All those videos will be invisible to the query as-written.

What is needed is a predicate vocabulary that is general enough to capture the range of John Doe’s relationships, but still specific enough to express the true semantic meaning. In this case, one solution would be to define a more general predicate called “CreatedBy”:

One of the powers of RDF and semantic databases is the ability to infer information from relationships. For example, with semantics, you can assert the fact that “Photographer” and “FilmMaker” are sub-types of the “CreatedBy” relationship. You do that simply by adding two new triples to your database:

With these facts in the triple store, you can now write a SPARQL query that infers that FilmMaker and Photographer relationships are the same as “CreatedBy”. The following query would, for example, return all assets connected to John Doe through any “CreatedBy”, “FilmMaker” or “Photographer” relationships:

The example illustrates the fact that triple stores contain a wide range of information that is constantly expanding. In order to make sense of the complex graphs present in the database, data architects must design well-defined and consistent vocabularies of predicates and types that balance generality with specificity. Architects must also monitor the triples present in the database and carefully consider how the vocabulary needs to be updated to handle evolving information requirements.

Unfortunately, the number of practitioners who know how to develop flexible RDF vocabularies and SPARQL queries is still quite small. Nevertheless, an increasing number of companies are implementing triple stores and we appear to be reaching a tipping point where the technology will soon come into widespread use.

Recommendations on Implementing a Flexible Metadata Model

Flexible metadata models can be achieved using relational databases, document databases, and graph databases. Choosing which implementation to use is not cut-and-dried. The choice depends greatly on the capabilities of each specific organization and its metadata requirements. The following table offers recommendations on how to choose a database implementation, but these should only be taken as rough guidelines.

Table 1. Guidelines for implementing a flexible metadata model

Conclusion

The main premise of this series is that digital asset metadata cannot be represented by a single, unchanging metadata model and schema. Data architects need to embrace flexible models that allow metadata to vary widely across asset types and that accommodate constant change. While such flexible models can be implemented using traditional relational databases, they are generally easier to achieve with document and semantic databases.

Before choosing an implementation strategy, organizations need to look closely at their specific metadata requirements as well as their internal capabilities. Document and semantic-based approaches offer tremendous benefits, but require familiarity with new modeling and programming paradigms. The rapid evolution of these technologies also creates problems for Information Technology management, with the ever-present danger that today’s cutting edge system may quickly become obsolete.

Investing in new technologies and new skills is always expensive and risky. It is also necessary in order to meet the needs of today’s digital communication channels. If organizations want to derive maximum value from their digital assets, then they need to invest in their digital infrastructure.

Note: This is the fourth in a series of blog posts discussing the need for flexible data models when managing digital asset metadata. The series is based on Demian Hess’ article “Managing digital asset metadata”, Journal of Digital Media Management, Vol. 3, No. 2 (November 2014).

Demian Hess is Avalon Consulting, LLC's Director of Digital Asset Management and Publishing Systems. Demian has worked in online publishing since 2000, specializing in XML transformations and content management solutions. He has worked at Elsevier, SAGE Publications, Inc., and PubMed Central. After studying American Civilization and Computer Science at Brown University, he went on to complete a Master's in English at Oregon State University, as well as a Master's in Information Systems at Drexel University.

About Avalon Consulting, LLC

Avalon Consulting, LLC transforms data investments into actionable business results through the visioning and implementation of Big Data, Web Presence, Content Publishing, and Enterprise Search solutions. We are the trusted partner to over one hundred clients, primarily Global 2000 companies, public agencies, and institutions of higher learning.