Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. Join them; it only takes a minute:

I receive statistical data every 30 seconds that I want to store in my database so that I can analyze later. For example, every 30 seconds I could receive the number of oranges sold at a store in the last 30 seconds. Later, I want to retrieve this data from the database and use it to generate charts showing information like the number of oranges sold for a store over the last 24 hours, last x weeks, last x months, and last x years.

If I just dump everything into one table, it seems like it would grow very quickly, especially if you have lots of data sources (stores). My thought was that the data could be averaged so that it was less granular over time. That is, keep detailed records over the last couple of hours (entries in the DB for every 30 seconds), then perhaps averages of 15 minute time spans for the last few weeks, then keep averages of each day for the last few months, etc.

This way you have a large number of recent records, a good number of relatively older records, and a few old records. However all the data is still there, it's just summed up and averaged into one entry over days or months instead of 30 seconds.

Does this approach make sense? Is there a better approach? How would I organize this into a table? Would it be multiple tables? Is SQL (probably MySQL) a good fit or would something work better? Any thoughts on this would be greatly appreciated!

3 Answers
3

You definitely want to keep all the data you collect, as it can be extremely useful for longer term detailed trending, even if it can take a while to trawl through it all. Also, when summarising up the data to less granular tables, do not AVERAGE() the data you have collected - alwaysSUM() and COUNT() the rows that you are summarising - this allows you to summarise that data higher if needed, and you can calculate the averages at whatever level you desire.

Remember...

You cannot average averages...

In terms of the data structures, I would take the following approach;

detailed_data - a table to hold the most granular level of data you have

minute_data - data summarised at the minute level

hour_data - data summarised at the hour level

day_data - etc

week_data - etc

month_data - etc

What I would do in your situation is very dependant on how you receive your stat data, but I can see there being a couple of simple options for this.

Option 1 - Create a Stored Procedure to Store the data

This would be the preferred option, as you can create separate stored procedures for adding, updating and deleting data from the main table, and the stored procedure can then handle the updating and summarisation of all the other data tables.

Option 2 - Create triggers on the data tables

You can use triggers so that when data is added to the detailed_data table, it automatically summarises itself into the minute_data table, which then triggers an update to the hour_data table, and so on up the chain. The drawback with this is when deleting or updating the stat data, you may have to create some quite clever triggers - but it is doable.

+1 for not averaging averages - at least not indescriminately! You can always reconstitute pre-averaged data into rates expressed in any common time denominator you may need. Once they are expressed in a common time denominator you can then average() to your heart's content.
– Joel BrownOct 21 '11 at 22:29

Thanks for the answer. How would you recommend triggering a stored procedure if I went that route? With a cron job? The procedure would also be responsible for archiving data that was summarized, correct?
– VenesectrixOct 21 '11 at 22:51

1

I personally wouldn't go down the trigger route, I would create a stored procedure called "add_data" for example, and then within that one procedure add the data passed in to the main table, and then increment any of the counters and sums in the summary tables based on the existing data for the relevant keys. Then a "delete_data" procedure could get the single row from the main table, pull the values of the counts, etc and remove those values from each of the summary tables before deleting the row from the main table. Hope that makes sense, it can be tricky to explain, if easy to implement!
– Dave RixOct 23 '11 at 16:52

Okay that sounds pretty good. How would you call the stored procedures periodically?
– VenesectrixOct 25 '11 at 18:42

I wouldn't, I'd actually call the procedure with the data you want to insert into the database, and let the procedure do all the insertions and summations in one go. I would NOT insert the data into the main data table and then run a procedure to archive the data separately.
– Dave RixOct 25 '11 at 20:27

I'm frankly always nervous about throwing away details. For that reason I'd try to find a way to keep handy (or at least archive) the most granular data. That way, if your summarization requirements change, you can always go back to the source data and recalculate appropriate summary statistics.

As to your approach of calculating averages over varying times at various ages, this is a fair practice that would meet the needs of many organizations, to a point.

While it may certainly be true that you don't care about per 30 second periods after a few weeks, let alone one year from now, at what point do you stop increasing the reporting period? Do you stop at days or do you go to weeks? (or months? or quarters?)

Once you get past days, you run into issues with doing year over year comparisons, which many organizations rely upon. I have seen some datawarehouses where there are multiple periods precalculated. Whether you do this or just pick one "longest period" e.g. days, is a matter of how you trade off quick access and redundant data.

Regarding "How would I organize this into a table?" the best way would be to have starting and ending date/times (down to the second) for the period being represented, along with the average count over that period. For convenience, you might also include a partitioning attribute that describes the length of the period, e.g. 'm', 'h', 'd', 'w', 'M', 'q', 'y' or whatever makes sense for your summaries.

So one table for all summary levels, correct? Are there any advantages to maintaining separate tables for the different summary levels?
– Nick ChammasOct 21 '11 at 0:03

I would keep one table for all summary levels. If you segregate these out then you are going to need to make schema and query changes every time your business requirements change to include (or drop) a new summary level. Since your summary records will have a starting and ending time and possibly a partitioning attribute, you won't gain anything by segregating records with different durations into separate tables.
– Joel BrownOct 21 '11 at 17:50

Good advice, thanks for contributing. I'll be considering archiving the data instead of deleting it now.
– VenesectrixOct 25 '11 at 18:43

Don't get too worked up about data volumes. You can mitigate a lot of performance issues by partitioning the table on a time key. Just load the data into the table and make sure all queries against it use the partition key so the optimiser can ignore partitions that it doesn't need. You will need to make an incremental loader, though.

If you want to analyse trends over time then make up summary tables and refresh them on a periodic basis. This will be less complex than consolidating historic data, and gives you the option of going back and auditing the detail if necessary.

Additionally, if your source systems are purged on a regular basis (quite common with POS systems due to the data volumes) then you might not have the option of reloading historic data. Warehousing the detailed transactional data also gives you a repository and frees the line-of-business systems from having to maintain historic data.

If you really need to, you can keep data for a finite length of time by clearing out older partitions and archiving them off somewhere else. The partitioned architecture makes this fairly easy to do.

An SQL database is a good fit for this because its query facilities are much better than those available from nosql databases and (more importantly) because it will play nicely with third party reporting tools. If you want to do any significant analytical work you might be better off with PostgreSQL than MySQL.