Metadata vs Data: a wholly artificial distinction

Computer scientists are fond of talking about metadata. There often seems to be an assumption that drawing a distinction between metadata and data is useful and perhaps even necessary.

At an architectural level, I think that’s entirely wrong. Any storage architecture that maintains a distinction between metadata and data has real problems that will limit its flexibility and usefulness. Note that I’m not saying that an application shouldn’t maintain a distinction between metadata and data, or that applications shouldn’t present things to users in those terms, or that it’s not useful to think in terms of metadata and data. I’m also not claiming that every storage architecture needs to be flexible – there are obviously times where that appears unnecessary (though in many cases you may end up wanting more flexibility).

I’ll simply argue that if you aim to build a storage architecture with real flexibility, maintaining a distinction between data and metadata runs directly counter to your goal. Below I’ll outline some reasons why.

But first, consider the natural world. If you talk to a regular person — meaning someone who’s not a computer scientist, a librarian, an archivist etc. — and ask them if they know what metadata is, you’ll probably draw a blank. Why is that? It’s because the distinction between data and metadata is entirely artificial. It does not exist in the real world, and it’s clear that regular people can get by just fine without it. Fluidinfo draws its inspiration from the way we work with information in the natural world, and maintains no such distinction.

It’s interesting to speculate on the origins of the metadata vs data distinction. I’d love to know its full history. I suspect that it arose from early architectural constraints, from the relative design and programming ease of maintaining a set of constant-size chunks of information about files apart from the dynamic and variable-size memory required by the contents of files. I suspect it probably also has to do with architectural limitations and the slowness of early machines.

Here then are the main reasons why the distinction is harmful.

Two access methods: When metadata and data are stored separately, the way to get at those two different things is likely to be different. Consider inodes in a UNIX filesystem versus the disk blocks containing file data. They are stored differently and cannot be accessed in a uniform way. This causes internal complexity for the storage architecture.

Two permissions systems: There are likely to be two permissions systems governing changes to metadata and data. This is another source of internal complexity for the architecture.

Search across the two is complex or impossible: Why has it traditionally been so hard to find, for example, a file with “accounts” in its name and “automobiles” in the contents? Because this is a simultaneous search across file metadata and file content. The division between metadata (the name) and the data (the content) made such searches extremely difficult. Even with modern systems it’s awkward. Consider the UNIX find command which searches based on file metadata and the grep command which searches file contents. Combining the two is not easy. It’s at least possible in some systems these days, but that’s because those systems pull all the information together and build a separate index on it – i.e., they allow it by removing the division between metadata and data.

A central piece of content: Systems, especially document or file systems, usually maintain a distinction between the content and the metadata about the content. But the real world doesn’t work that way. You may possess information about something without having the thing. There may be no pieces of content, or there may be many.

Who decides?: If a system maintains a distinction between metadata and data, who decides which is which? Almost inevitably, it’s a programmer, a system architect, or a product manager who makes those decisions. There’s an implicit assertion that they know more about your information than you do. They decide what should be in the metadata. While there are systems that let users create metadata, they are usually limited in scope – someone has decided in advance how much metadata a regular user should be allowed to create, what kind of metadata it can be, how it will be used, how users will be allowed to search on it, etc. The intentions are good, but the whole thing smacks of parental control, of hand-holding, of “trust us, we know better than you do”.

Time dependency at creation: Systems maintaining the distinction also introduce an unnatural time dependency. Until the content (i.e., the data) is available, there’s nowhere to put the metadata. E.g., a file object has to be created before it can have metadata, a web page has to come into existence before you can tag it. But the real world doesn’t work that way. E.g., you can have an opinion about someone you’ve never met, or someone who’s dead or fictional. You can have a summary of a call agenda before the call happens, or notes about a meeting before the minutes of the meeting are prepared.

Time dependency at deletion: The awkward time dependency bites when the content is deleted too. The metadata necessarily vanishes because the architecture doesn’t allow it to persist: there’s literally nowhere to put it. Once again, the real world doesn’t work that way. E.g., you’re sent a large image file of someone’s pet cat – you take a look and, to show you care, make a mental note of its name and breed, but you delete the image because you don’t want to store it. Or suppose you give away or lose your copy of Moby Dick – you don’t therefore immediately forget the book’s title, its plot, the author, the name of the main character, an idea of how long it is, the book’s first line, etc. The “content” is gone, but the metadata remains. You may have never owned the book, you may think you have a copy but do not, you may have two copies – in the natural world it just doesn’t matter, and nor should it in a storage architecture. Interestingly, Amazon are currently being sued because they threw away someone’s metadata in the process of removing a copy of Orwell’s 1984 from a Kindle. You can bet the metadata was removed automatically when the content was removed.

OK, enough examples for now.

Fluidinfo has none of the problems listed above. It has absolutely no distinction between metadata and data. It has a single permissions system that mediates access to all information. When a tag (perhaps used or presented as the “content” by an application) is removed from an object, all the other tags remain. There is no distinction between important system information and the information stored by any regular user or application – they’re all on an equal footing, and that includes future applications and users. No-one gets to set the rules about what’s more important and what’s not, there’s simply no distinction. You can search on anything, using a single query language – the system uses the query language to find things it needs, just like any other application. The single permission system mediates who can do what – equally and uniformly.

I used to argue that everything should just be considered data. But I think David Weinberger puts it better in Everything is Miscellaneous where he says it’s all metadata. Call it what you will, it’s clear (to me at least) that at a fundamental level there should be no distinction.

BTW, if you’re into self-reference, you might also interested to know that Fluidinfo uses itself to implement its permissions system. Permissions are just more information, after all. Fluidinfo stores that information for tags, namespaces, and users onto the regular Fluidinfo objects that are about those things. There truly is no metadata / data distinction. It’s a little like Lisp: once you have the core system in place, you can (and should) use it to implement the wider system.

In this entry the author mistakenly assumes the real World does not use metadata. In my perception he could not be more wrong. In another blog similar to this one I made a joke and suggested we might as well could get rid of street names or street numbers, since we all use navigation devices we can do without them …… right?
A streetname is metadata for a specific location where numerous houses are forming a logical unity, followed by the second tag: the house number which is also metadata. Just because the real world does not know, what we mean by “Metadata” does not mean the real world can live without it.

Metadata is important to bundle and limit collections. If I want to have a listing of incoming invoices I don’t want correspondence about any invoice, I just want to have the invoices. The word “invoice” is the tag I will use to store and to retrieve them. If I want to narrow my search results down I will add a second tag to my query, let’s say the name of the company which send me that invoice in the first place. This is also a tag belonging to “metadata”

Whenever I want to emigrate my existing data to another hosting application I definitely need to be able to supply metadata pointing to the right files, otherwise the end result, in the new application, will just be a scrapheap of information useless to anybody since there is no way to tell what’s there.

I guess I wasn’t clear enough in the second paragraph. I’m not saying that the world doesn’t use metadata (which I guess we can agree is just information about other information) – of course it does, otherwise the world would be a very different place and survival itself wouldn’t be possible (if you consider making summaries and drawing generalizations to be a form of metadata). I’m not saying that it’s not useful to think in those terms, etc. I agree with your points entirely.

What I am saying is that in programming a storage architecture, having a low-level and fundamental distinction between two types of data leads to many problems. It’s better (IMO) to have a completely uniform storage architecture. Applications can use it in any way they like, including to store what to them (and to their users) is metadata. In fact that’s one of the major initial goals of FluidDB – to be a metadata engine for everything. So that’s how important we think metadata is! The way to support metadata on anything is to have an underlying architecture that’s flexible enough to allow that to happen – without someone setting the thing up with an a priori determination of what’s meta- and what’s not. True support for metadata is too important for that – to do it properly you need the architecture to be neutral.

I hope that’s clearer & sorry for any confusion!

Terry

leonvanoosterom

Terry,

You certainly made your point much clearer, thank you.

I see this happening in many many discussions where IT people tell something to archivists. When both worlds would agree that both speak different languages, life would be much easier and more projects would turn into success stories.

nice & academic. Not realistic in the real world, however. Metadata is essential for practical reasons. I do think that a layer of abstraction such as a file system is nice. Sure, you could argue that the “truth is in the file” and why bother to add complexity. But really, are you traveling with a your car mechanic or do you rely on the aggregated information the dashboard presents you? Certainly this discussion has different facets – I look at metadata that lives outside of a database. Particularly, file-based metadata: its essential for the survival of modern ECM and DAM systems.

Thanks for commenting. Sorry this appeared so academic – I can assure you it's not though (for me), as we've built FluidDB on the above principles and it's a released product.

But more importantly, what I meant to convey seems to have not been clear. I agree 100% that metadata is essential in practice. Do the comments I made just above help to make that any clearer? FluidDB is designed (among other things) to support arbitrary metadata – because metadata is so vital. It's just that at an architectural level to do that properly I think it's important to have no distinction. But at higher levels it's essential, as you say.

I think the whole metadata/data deal is about who gets to say what. Data is created by somebody, metadata is added by somebody else. Librarians add metadata to books. The OS adds data to files. But of course it's all just data; and since you can also tag metadata you will end up with an infinite level hierarchy that will bust the Universe such as we know it.

A streetname is metadata for a specific location where numerous houses are forming a logical unity, followed by the second tag: the house number which is also metadata. Just because the real world does not know, what we mean by “Metadata” does not mean the real world can live without it.

I agree. It's all just information. I'm not saying we can live without it, that would (at least in my mind be like saying that we can live without information). I'm just saying – as in the title – that the *distinction* between the two is artificial. Normal humans don't need or want or understand such a distinction. Computational systems are very often built on that distinction, though. I think that's a mistake.

Thanks a lot for taking the trouble to comment. I hope the above makes it a little clearer.

Metadata vs Data a good thing to happen because They has to be an assumption to drawing a distinction between metadata and data is useful and perhaps even necessary, and I think this is some thing what they have been made of.