Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

I'm responsible for writing an app to juggle a huge amount of client and business intelligence information. This information involves records for various businesses and metrics like revenue, their web platform, etc. For reporting purposes, it has been decided that the data now must be time-stamped so that we may construct reports like "Foo Inc.'s revenue increased by 200% when they switched to digital television in 2010."

Or something to that effect. We don't really deal with clients that are that far behind the times.

As it stands, the data is stored in MongoDB and its schema (or lack thereof) exactly represents how it's displayed. A document has a name, maybe embedded documents for address information or operations-related stats, etc. I somehow need to add the concept of time-stamping, or versioning to this. The most obvious way to model this would be like so:

key: [{ value: "bar", time: "2012" }, { value: "baz", time: "2011" }]

Where every property is timestamped. Obviously, this breaks the schema, and adds maintenance issues. I've also toyed with using conventions (naming property keys like "Foo (2012-09-31)", and keeping copies of entire collections with a timestamp.

Obviously, this is a very big feature to implement and I was wondering what the best way to implement this would be. What is the most effective and performance-friendly way to add timestamps to data in MongoDB?

Edit:
Yes, I'm looking at versioning, although with all likelihood new data sets will come in large batches and not as often (for instance, every quarter). This is why I was considering using separate collections for each set and timestamping each collection, because the data is lots and far between, but I'd love to know if there's a better, more maintainable way.

2 Answers
2

Assuming you are using ObjectIDs for your _id column, then you will already have an insert based timestamp. You can even extract it from the shell directly if you wish. Of course, you may want to leave this as-is to permanently store your original insertion timestamp, but adding another would be relatively trivial. Because most drivers have the ability to generate their own ObjectID and you have the published spec you can easily insert your own with a little research, and in the language of your choice.

Since ObjecIDs are the default _id value, they are also going to be a good choice for indexing, well supported in the drivers etc.

Alternatively, in terms of implementing pure timestamps in MongoDB, there is also a BSON Timestamp data type.

Storing this (or the aforementioned ObjectID) in a standard field (say "ts" for example) across documents/values would add overhead to a field/doc as you mention, but you should be able to standardize it easily (field.ts would always have a timestamp value) as well as predict the overhead for each field with a timestamp (8 bytes for a timestamp, or 12 bytes for an ObjectID - for more info see bsonspec.org).

Given that you have two native, known datatypes that provide timestamp functionality, they would be my recommendation as a way forward here.

In terms of adding them to existing data, you can choose to "lazily" add them whenever the data is next touched or issue a batch update, depending on your needs - the benefit of a flexible schema.

Your recommendations are sound, but the new schema design depends on how they want to query it, and whether they want to keep track of more than one version of the data. I'm still not sure I understand the use case for storing dates along with each field. How would you make use of that data? "find all records where field_a was updated after monday"? If you have field_a: {ts: ..., val: ...}, what if you want to store another field_a with an older date (which is what I think he wants to do)?
–
Wes FreemanAug 14 '12 at 16:25

Ah, well, I wouldn't necessarily implement that myself from scratch, but I was focusing on the timestamping element of the question (which is the subject of the question). There are options for versioning such as the one mentioned here: stackoverflow.com/questions/9042427/…
–
Adam CAug 14 '12 at 16:37

Yeah, I do need to version it (although I don't expect the data will be changed every few days, but maybe a large batch update with many records every few months). I'll add that to my question, but that's why I was looking at completely separate collections with a timestamp on the collection level - I expect it'll be a lot of data but not changing often.
–
ArtAug 14 '12 at 17:57

Copy each record when it is modified, with a new timestamp for the whole record. You can have a flag called "active", to add to queries, so that you can easily find the latest record. Or put inactive records in a different collection (if you don't care about querying them with the active records). This is probably how I would do it, but it does have more overhead than an in-document record.

Alternatively, make an array with values and timestamps (this is basically what you mentioned in your original post--restating to mention some pros and cons), for fields, like:

This allows you to keep the history within the document. The risk here is that you'll create very large documents if you have many changes, but you can mitigate that by keeping only the last 50 changes or some limit, depending on what you need. This schema does make it slightly harder to query for particular values in fields, but depending on your queries it might not matter.

+1 I was thinking of something like that, too.
–
BryceAtNetwork23Aug 14 '12 at 17:53

My queries will involve very complex MapReduce and reporting algorithms, so keeping a list of values and timestamps will be a nightmare to maintain it. I'm trying to avoid this unless there's no better way.
–
ArtAug 14 '12 at 17:55