Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It's 100% free, no registration required.

I'm currently working with a table exceeding 12 million rows (at about 3 GB when exported with mysqldump) and am curious how huge a table can realistically get without any serious performance impact. The table is growing about 100,000-200,000 rows a day or so.

Should I be starting to think a little about sharding this data across multiple tables or mysql instances now before the data gets much larger? Currently the server this is running on has 1 GB of ram (though it'll soon be moving to a machine with 3/4 GB).

Has anyone got any tips/recommended reading that would push me in the right direction, or is this something I don't need to worry about yet?

It's the core user data for the web-app. This is where the majority of data saved in the app is kept, however I would say data isn't accessed very frequently after a week or so, as its a twitter based service so it gets old fast.
–
tarnfeldDec 28 '11 at 0:22

2 Answers
2

A good RDBMS can grow to accommodate extremely large data. 3Gb databases are very manageable, and very probably, as long as you can get a server with enough RAM, most queries will run pretty fast with little effort.

Even when you exceed RAM, indexes, caching and partitioning let you still perform well. Very often, applications access most a relatively small working set- for instance, 90% queries might be limited to last month's data- while the 10% can be queries over older data. "Last month's" data tends to be somewhat stable- it grows when you have more users, but other than that, it doesn't tend to grow with time. This "working set" often fits in RAM, gets cached and you still get great performance.

But then, you can again get slowness. With proper monitoring and analysis, you can locate the queries which are running slow and take steps to solve them.

This is often simple:

Queries or incorrect code: often we write queries that retrieve information which is unneeded, or which execute several queries when one would suffice and be faster (the typical case is an operation which needs to display n rows of a table and performs n queries when 1 would be enough. Performing aggregates outside the database is also frequent). This is easily fixed by changing your code

Queries which do not run efficiently. EXPLAIN is your friend here. Often, creating indexes that the query can use is enough (roughly, you'll want to index on columns which appear in the WHERE clause). Also, sometimes tweaking the query itself will yield good results

Another approach which gives good results is throwing hardware at the problem:

Buying more RAM

Getting more and better disks (RAID10 setups get faster if you add more disks, SSDs often have significant benefits, etc.)

It is not frequent, but sometimes CPU is your bottleneck- you can get faster processors and more cores/processors

In some other cases, replication and sharding might be a problem. Replication is complicated, but stuff such as Oracle RAC let's you build monster clusters (at a price). Sharding is another option, but it's often one of the most complex to implement- even applications which shard easily require lots of work to be sharded, and some applications can be notoriously hard to shard.

Thanks for your in-depth answer. I need to do more research about mysql RAM usage and how to setup the server to perform best with my 4 GB of ram. The largest table (of 12 m rows) are for status updates, so users post those, then they get seen for a while, then they don't get accessed very often after that. Are there any tools which allow me to profile what mysql is storing in RAM and what it isn't, so I can tweak things a little more? I'm also using redis as a front end to mysql to speed things up.
–
tarnfeldDec 28 '11 at 11:20

I'm also not doing very many complex queries due to the bulk of the app being around this one table, most are simple SELECT * WHERE id = 'n' or UPDATE x WHERE field = 'b' and simple inserts. I think managing my RAM a lot better is going to solve lots of problems.
–
tarnfeldDec 28 '11 at 11:22

You rarely have control over caching (IIRC, Oracle has this using keep/recycle pools), and I don't think you'll need it- if you have proper indexes, caching will work just fine. If you only do queries with exact matches on two columns, an index for each column will likely yield excellent performance. In any case, I suggest you build a test server (a 4Gb box should be pretty cheap), create a load test (this should be easy given your usage patterns) and try different indexing and configuration options and measure improvements to your load test.
–
alexDec 28 '11 at 17:59

I think the minute your DB hits disk - your performance degrades, so you need to make sure your machine has more RAM than the size of your DB.
Good solutions for you are partioning or sharding (for sharding, check out http://www.scalebase.com - they do transparent sharding, so it's effortless)