Scalable Data Models

Most applications involve storing and retrieving some form of data. It can be posts on a blog, financial information, status updates of friends, or any other large number of diverse topics. Some data is fairly simple and straight forward. A blog for instance is not a very complicated structure to model. There’s a title, the post body, an author, a creation date, maybe even a few fancy things like tags or categories. Overall though it isn’t that complicated, and the interaction between that piece of data and other data isn’t overly complex.

However, some data is complicated, and the interaction between that data and other pieces of data can get quite complicated. That complication needs to be pondered quite heavily when creating your data model. You might think you know all of the data fields of each piece of functionality and write up a quick data model based on the assumptions of what you’re storing. However, in large projects where the usage is high, choosing the correct data model can often be more about reading the data than writing.

Because let’s face it, most data is read, not written. Every time you write some data into your data store, that same piece of data is probably read multiple times by different users or used multiple times in different reports. Writing data is often infrequent compared to how often it is read. In highly scalable designs, you should account for the unbalanced nature of your reads to your writes. This might involve spending a little more time to make your writes by computing additional metadata in advance so that each read will already have that metadata rather than computing it at read time.

One structure you can use for increased read times and reduced computation times is a summary table. A summary table is basically a pre-computed set of data based on your real data table that can be pulled from during read requests rather than deriving the results from your real data on each request. For example, perhaps you have a one to many relationship between Foo and Bar. A single Foo can contain zero to many Bars. Each Bar has a Baz property that can be set to either 1, 2 or 3. Now imagine you want to list the number of Foo items that have Bars with a Baz of 2, and you want to display those results like so:

To generate that table your program has to go through 3 for loops. Ok you can probably remove the last for loop if you put the print statement in the inner loop above it, however most of the time you’re going to be separating generating the data from displaying the data into different functions so the second iteration will be required. This is an O(n²) operation due to the nested loop.

Imagine if you have 200 Foo objects, each of which has on average 10 million Bar objects. Just storing the objects as tuples in your data store and compiling the report on each request is going to bog down your processing machine very quickly, especially if you receive loads ranging in the hundreds per second. However, if you pre-determine this information during your write, you can easily retrieve the result at any time with a very small processor cost.

Now instead of selecting a count(), joining on Bar, and grouping by Foo, you can merely select from FooBaz with a possible where clause.

This is a rather contrived example that you might not see as great of a speed increase as in the real world, however if you imagine a much more complex data model with many nested objects that would require multiple joins to generate the proper data; or even worse, multiple joins followed by application level processing followed by additional selects followed by more processing; you will see where pre-calculating results upon writes can be a large time saver.

Sometimes it isn’t possible to update a result simply upon a write however. One is forced to do a full set of computations in order to get a result based on all of the data available after the write has happened. This does not mean that you can’t use summary tables with pre-computed results however. You might need to run a separate task to update the summary table on a scheduled basis with a cron job, allowing reports requested to be slightly lagging real-time data. If this is a trade-off that you can afford, scheduled report generation can make impossible request-time level reports turn into possible time-delayed reports.