Related articles by Zemanta

One of the principle points of contention within the cloud storage space is what to do with metadata (literally: “data about data”). There are several schools of thought regarding how metadata should be presented, protected and optimized but for the sake of this post, I’ll just tackle 2 of them. These two thought processes are: (a) metadata appended directly to the objects that they represent and (b) metadata separated from the object and stored separately in a database or other type of referencable system object. Let’s unpack some basic advantages/disadvantages of each approach.

Metadata wrapped directly around the object

Image by Matthew Bietz via Flickr

I’m going to refer to this first world view as the “bacon-wrapped scallops” worldview. Objects (the scallop) are wrapped in their respective metadata references (the bacon). The benefit here is that the metadata becomes an intrinsic part of the object that’s being referenced. Replication policies, etc. can be directly enacted on an object without much cause or care for concern. Additionally, any loss of metadata reference affects a singular object only, not globally, thus restraining chain failures across a filesystem.

Issues with this approach are fundamentally tied to portability, corruption and performance. Portability is a double-edged sword. It’s great when you want to make sure that the object can exist in multiple locations but since the metadata is bound to an object, locality of reference can be skewed. In other words, having an extant meta database provides simple records updating for replication and locality whereas bound meta requires updates to EACH replica object. This also ties into performance as this update process across XXX of objects can take time as it’s a hunt operation within a matrix of (potentially) billions of files. Corruption, finally, can potentially wreak havoc in a wrapped model because each object meta can be corrupted and, when sync’ed to replicas can pass on corruption.

If metadata being wrapped to the object is the “bacon-wrapped scallops” approach, then the concept of keep metadata separate from the object is, well, “Scallops with Creamed Corn, Asparagus, Pearl Onions, and Coffee Cocoa Chile Butter.” (don’t laugh, you can find this recipe here). Obviously, this is a more complicated recipe than just “bacon-wrapped scallops” but it does prove a certain point: separating metadata can have an obvious benefit of “improving” the base characterization of the underlying object while allowing for methods of portability (exemplified by the recipe above) and for customization that exceed meta-wrapped objects. Advantages include the ability to provide UID reference points in an extant database that is subject to its own protection schemas (either by replication or otherwise), performancing (meta references need only be updated to the db by design, not to the object(s) present), and corruption prevention (object corruption occurs independent of meta and vice-versa).

Disadvantages would still be present as you now have to design for an extant meta database(s) to equalize performance across nodes (as noted in my master/slave node topology here). Additionally, a protection mechanism for these databases could have performance implications (especially in synchronous, off-box schemas) as a vast number of records (potentially, again, billions) would have to be appended. Another potential double-edge sword is the need to have some level of metadatabase replication in place to other nodes within the cloud storage platform in order to enforce meta-driven policies. Since the reference model is based on object UID in the database, these databases have to be persistent through the entire platform, potentially driving up storage capacity utilization.

Hopefully, this makes sense and I’d appreciate any feedback that you might have!

Related articles by Zemanta

Well folks, it’s been fun being your “EMC Technical Consultant on the Interweb.” (there are more of us, trust me. 😉 ) It is with great happiness, however, that I get to report that I’m moving over to what has to be the most exciting innovation within EMC in the short time I’ve been here: […]

UPDATE: so, we have a winner. @saurabhg wins a t-shirt.
So, after a few days of beating around this problem, I’ve come to the conclusion that either I’m a complete Linux tool (likely) or there is something fundamentally wrong with FUSE 2.7.4 and CentOS 5.3 (x86-64 edition). (FUSE = Filesytem in User Space) To […]