Data Storage: Metadata Everywhere (And Not a Drop to Drink)

Posted on March 20, 2014 By Henry Newman

This article's catchy title comes from The Rime of the Ancient Mariner by Samuel Taylor Coleridge, but a co-worker suggested I might want to title it "Metadata everywhere causing everyone to drink" instead.

The word "metadata" is used to describe lots of different things. As a storage guy, my introduction to metadata was in discussing file systems and requirements for metadata. Ask a librarian, and metadata is something totally different. Ask a database person, and again you will get a different answer. Look at a jpg header and you have different metadata about the file, not the file system. "Metadata" might be one of the most over used and confusing terms being used today. It might as well be a pronoun like “that,” which can mean just about anything any time.

I want to talk about a few types of metadata. I'll share some of my thoughts on what needs to happen in the future and what I think will happen and why.

High-level storage systems metadata

High-level storage systems metadata" is a term I just made up. The reason I used the words "high-level" is that I view low-level storage systems metadata as information about how the controller views the underlying storage under its control. Low-level information includes information on disks in RAID groups, LUN information and data on virtualization of the drives, replication and lots of other stuff.

To me "high-level" implies that users are going to access information via an interface, file system, protocol or REST. We used to just call it file system metadata. That term had specific meanings, connotations and implications for about 25 years. For example, file system metadata had to support POSIX file system standards. It included things like user id (UID), group id (GID), access time (atime), create time (ctime) and other categories that are part of standard POSIX, plus additions for things like NFS and access control lists (ACLs). All of this information for file systems was stored in the inode, whether that was a local file system or on a NAS device. The specific minimal set was defined by the POSIX standards, and NFS added things like ACLs. In addition, there were extensions some vendors added using POSIX-extended attributes for things like high hierarchical storage management. The inode also stored the location within the file system on disk. Different vendors implemented things a bit differently, but all of this was in the inode or an extended attribute that was viewed as part of the inode.

Well that was then; this is now. With new interfaces such as REST, POSIX-compliant and like inodes are just one of the metadata components that users might see.

The problem with POSIX inodes and the standard was that it was not very extensible. Of course, extensibility was theoretically built in with POSIX extended attributes. But as we all know, theory and reality are often different. Issues arose because each vendor might—and often did—use extended attributes, but the attributes between vendors and various file systems were never agreed upon. About a decade ago, a group of people in the US government and HPC tried to extend POSIX and were rejected by the OpenGroup which controls POSIX. Extended attributes were defined only for a specific file system. POSIX, therefore in my opinion, is pretty limited for adding modern metadata information.

REST interfaces, on the other hand, are easy to extend compared to file systems. Many REST implementations are accomplished with databases, and adding a new bit of information is pretty easy. The mapping of objects to their storage locations is often similar to what is done in standard file systems today, but developers have learned a great deal over the last 25 to 30 years. There are now lots of new things users can provide per object like checksums and field. All of this gives REST a big advantage, but from what I have seen, there is not a great deal of coordination so that everyone gets the same per object metadata and you can easily move from one object store to another.

Per file/object metadata

Lots of file types (soon they might be named "object types") have their own metadata which describes the file or object. For example, a jpg has lots of information describing the file. For medical images, there are Dicom files, and the format is controlled by National Electrical Manufacturers Association (NEMA), many of whose members are vendors that make scanning equipment like MR, CT and X-rays. In weather, there is GRIB (Gridded Binary or General Regularly-distributed Information in Binary form) which has been agreed upon by the World Meteorological Organization. Another example would be oil and gas exploration companies, who also have standard formats, and yet another example would be the electric power grid community.