It's a girl

One of the things that we are planning for Raven 3.0 is the introducing of additional options. In addition to having RavenDB, we will also have RavenFS, which is a replicated file system with an eye toward very large files. But that isn’t what I want to talk about today. Today I would like to talk about something that is currently just in my head. I don’t even have a proper name for it yet.

Here is the deal, RavenDB is very good for data that you care about individually. Orders, customers, etc. You track, modify and work with each document independently. If you are writing a lot of data that isn’t really relevant on its own, but only as an aggregate, that is probably not a good use case for RavenDB.

Examples for such things include logs, click streams, event tracking, etc. The trivial example would be any reality show, where you have a lot of users sending messages to vote for a particular candidate, and you don’t really care for the individual data points, only the aggregate. Other things might be to want to track how many items were sold in a particular period based on region, etc.

Sounds like what you are describing is a raven event store... (As in the store for event sourcing patterns such as cqrs/es) Which I think would be a great idea... Having a raven event store that could project to a raven db for the domain model/ read side ... Using ravens own publish/subscribe model for consistency sounds really interesting ...

This is really exciting! something I missed using RavenDB and would use it right away to do analytics.
To do these aggregations or queries over large datasets in the past I’ve been importing data into column databases or running Rhino-ETL jobs to aggregate data, very tedious.
I could actually see a use for drilling down to see what data points an aggregate is built on.

This is a great idea, RavenES (Event Store? RavenStream?) where you can write and read streams of data related to one Id (ContextId? StreamId? - Log file, Aggregate Root, GPS coordinates, etc.) and aggregate the values in map/reduce. Each item related to an Id has a Revision/Sequence and it is a read-only, forward-only stream you can access. You could also access substreams (lets say log entries for a specific day, or an aggregate root events up to a specific revision) but always in order.

What would be cool is if you could easily do an IEnumerable.Aggregate on a stream and it would run server side (for example, rebuild and Aggregate Root from an event stream), or even better run an aggregation and write the result to RavenDB as a document, something like CreateDocumentFromStream? For logs it would be building a stats document, for GPS location maybe an itinerary, etc.

I think this is a fantastic idea! This use case is exactly why we ended up not using RavenDB in our application. We need to log lots of information quickly, and then perform off-line ad-hoc queries against that data for statistical data regarding production runs.

I like the idea, but inevitably people are going to be curious as to how they got a certain result. This means they'll want to dive into smaller subsections of the overall stream. The smallest subset would obviously be one document / item.

"Ad-Hoc" might be a little too liberal of a term. We have well defined "types" of statistics that we need extracted, but the time-date range is what can shift (i.e. I need a report for last month, last week, last shift, etc)

I guess the better question is what this product will allow you to do?

will it let you see an evolution to the final result? You could do this if you had another mechanism for snapshots based on a frequency set in the map/reduce definition. This gives the developer the ability to set up some form of historical context to their data.

ex. On Monday we saw that we were up 20% from Tuesday. (Graphs).

Will it let you see only the final result?

You could do this if you implemented it with snapshots, or without. Implementing it without snapshots would mean you would only every know the final result.

Time is the context here, and you can either choose to say all results are in the present or embrace time into the architecture. You could do snapshots for the user, or let the user query and save snapshots into another system (RavenDB?) based on their own approaches: scripting, C# client, Ruby, etc.

This system would be perfect for the MarkedUp team (markedup.com). Maybe you should reach out to them and get their thoughts.

This reminds me of my sensors sample. https://github.com/mj1856/RavenSensors. I'll echo the others by saying that time is of the essence. One thing Raven isn't good at is querying data over an arbitrary time range. You have to predetermine the granularity of the buckets. If you can improve on this in any way, it would be a big deal.