Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

Background:
I’ve created a web application that I would like to be able to scale reasonably well. I know I'm not Google or Twitter, but my app uses a fairly large amount of data for each user and thus has fairly high data requirements. I want to be ready to scale reasonably well without having to re-architect everything later.

I consider myself a software developer, not a database expert. That’s why I am posting here. Hopefully someone with a lot more database expertise can give me advice.

With a relatively large number of users, but nothing like Facebook numbers, I expect to have a DB that looks like this:

One "Big table":

250 million records

20 columns

Approximately 100 GB of data

Has an indexed bigint(20) foreign key

Has an indexed varchar(500) string_id column

Has an int(11) "value" column

4 other tables:

10 million records each

Approximately 2 - 4 GB of data each

each of these tables has 4 - 8 columns

one column is datetime date_created

one column is the varchar(500) string_id column

one or two columns from each of these tables will be selected in a join

One of these tables is used for storing averages -- its schema is bigint(20) id, varchar(20) string_id, datetime date_created, float average_value

What I want to do -- two relatively expensive queries:

Calculate new average values:

Using a foreign key, select up to several million separate records from the big table.

Calculate a new average, grouping by the string_id .

Insert results into the averages table.

As currently constructed, this query uses two joins.

Create de-normalized, read-only records for serving users:

Use a foreign key to select anywhere from 1,000-40,000 records from the big table.

Join with each of the other four tables on the newest record with the string id column.

Insert the results into a de-normalized table.

These records are for use by the front-end to display information to users.

As currently constructed, this query uses four joins.

I plan to run each of these expensive queries on a batch back-end database that will push its results to a real-time front-end DB server which handles requests from users.
These queries will be run at regular intervals. I haven't decided how often. The average query could be done perhaps once per day. The de-normalize query will need to be more frequent -- perhaps every few minutes.

Each of these queries currently runs in a few seconds in MySQL on a very low-end machine with a dataset with 100K records in the “big table.” I am concerned about both my ability to scale and the costs of scaling.

Questions:

Does this approach seem sound? Is there anything obviously wrong with it from a big-picture perspective?

Is a RDBMS the right tool, or should I look at other "big data" solutions like something in the Hadoop family? My inclination is to use a RDBMS because the data is structured and fits nicely into the relational model. At a certain point though, it is my understanding that I may no longer be able to use a RDBMS. Is that true? When would this switch be needed?

Will it work? Can these queries be run in a reasonable amount of time? I can wait perhaps hours for query #1, but query #2 should finish in minutes.

What should I consider from a hardware perspective? What are my RAM and CPU bottlenecks likely to be? I assume keeping indexes in RAM is important. Is there anything else I should consider?

At some point I will probably have to partition my data and use multiple servers. Does my use case seem like it is already in that category, or will I be able to scale a single machine vertically for a while? Will this work with 10x the data? 100x?

This question came from our site for professional and enthusiast programmers.

This one is tough to answer thoroughly. Maybe you are better off researching about MySQL query performance characteristics in general so you know what you can expect.; One thing that you can always do of course is put 20 disks in the server so you can read at 3GB/s or so. But I think you are after a thorough software-only answer.
–
usrAug 28 '12 at 22:24

3 Answers
3

Have you tried piling more data and benchmarking it? 100K rows is inconsequential. Try 250M or 500M like you're expecting you'll need to handle and see where the bottlenecks are.

An RDBMS can do a lot of things if you pay careful attention to the limitations and try and work with the strengths of the system. They're exceptionally good at some things, and terrible at others, so you will need to experiment to be sure it's the right fit.

For some batch processing jobs, you really cannot beat flat files, loading the data into RAM, smashing it around using a series of loops and temporary variables, and dumping out the results. MySQL will never, ever be able to match that sort of speed, but if tuned properly and used correctly it can come within an order of magnitude.

What you'll want to do is investigate how your data can be partitioned. Do you have one big set of data with too much in the way of cross-links to be able to split it up, or are there natural places to partition it? If you can partition it you won't have one table with a whole pile of rows, but potentially many significantly smaller ones. Smaller tables, with much smaller indexes, tend to perform better.

From a hardware perspective, you'll need to test to see how your platform performs. Sometimes memory is essential. Other times it's disk I/O. It really depends on what you're doing with the data. You'll need to pay close attention to your CPU usage and look for high levels of IO wait to know where the problem lies.

Whenever possible, split your data across multiple systems. You can use MySQL Cluster if you're feeling brave, or simply spin up many independent instances of MySQL where each stores an arbitrary portion of the complete data set using some partitioning scheme that makes sense.

@tadman Thanks for your advice. I realize there is no substitute for actually trying it out. I have not benchmarked it with 250M rows because I first wanted to make sure that there was nothing obviously wrong about my approach. It sounds like there isn't. In addition, getting that much data, and doing it in a way that is somewhat realistic is a challenge I haven't figured out how to solve yet. I do have some potential ways to partition the data. I guess I will next just try ramping up my data and see how it does at different checkpoints -- 1M, 10M, 100M, etc.
–
xnickmxAug 29 '12 at 15:54

Multiple indexes on the 'fact' table is usually a serious performance problem, especially for INSERTs. Are you having an issue there?

DATETIME is 8 bytes; TIMESTAMP is 4

Explicit FOREIGN KEY CONSTRAINTS are nice, but costly

JOINs may or may not be a performance problem; need to see the SELECT and CREATEs.

100GB is a nice size for a 'big' MySQL database; I suspect it could be made to work without Hadoop, etc. I deal with one such db now -- most UI pages respond in under a second even though the data is quite involved.

Will you be 'purging' data at some point? (This leads to the main use case for PARTITIONing.)

For serving up your front end data, unless there are gobs and gobs of inserts all the time, you really can't beat using triggers to insert into materialized views which are kept in sync with the back end but optimized to serve the data. Of course, you need to keep joins, etc, etc, to a minimum in these triggers. One strategy I've used is to queue these inserts/updates into an intermediate table and then send them along later every minute or so. It's a lot easier to send one record than 4 GB of records. 4 GB of data takes a long time to stream even if you can find the records you are looking for quickly.

I agree with tadman. The best is to profile it with the kind of data you are expecting on the kind of system you are wanting.

As I mentioned in my post, the views depend on a query that uses four joins across tables with tens of millions of records, so I don't really see how a materialized view is going to help.
–
xnickmxAug 29 '12 at 16:33

Triggers may not be fast enough for this size database. How many INSERTs per second are happening?
–
Rick JamesAug 29 '12 at 22:15

1

@xnickmx If there aren't so many insert/updates, triggers make it easy/performant to keep denormalized data in sync. If it needs to go faster for inserts/updates, queue them with something like this: blog.shlomoid.com/2008/04/… or bake your own. This way you don't have to join against the existing 100 million row tables to get the new data since when the trigger fires, you take advantage of the fact you know the new data right then and can just denormalize it as part of the tx or queue it for denormalization later.
–
wes.stueveAug 31 '12 at 16:02

@RickJames Agreed. You must take into account the amount of inserts for this kind of strategy and how fast they must process.
–
wes.stueveAug 31 '12 at 16:09