How To Build Scalable Database Architectures

I’ve found lately that munching on carrots with French dressing is more satisfying than broccoli. Maybe it’s the tang-and-crunch combination. In any case, I was crunching away yesterday while thinking about how to answer a question one of our newer start-up clients asked me.

No one has ever come out and formally asked me for a document that states “Best Practices to Scale Application X”. It is an unusual demand, since it’s something many of us at Pythian have implemented, but it’s been more of an ad hoc, iterative process — and rightly so, since architectures must be so organic, and so tailored to the application. What’s more, no one has ever brought us on board so early in the game that we have a hand in actually — gasp! — doing the design and data-model from the get-go. Woo hoo!

Now, a little background. I have built and maintained a few systems. Some of them even supported over 100k concurrent users. These databases didn’t run RAC either (although I do support two very high profile RAC environments now). So, having been in the trenches and knowing what it takes to make a DB move, I got to thinking about some of the basic fundamentals. There are always rules of thumb, right? This is what you need to know to start with building a scalable high-performance system based on stuff that I’ve seen. Obviously, this assumes a database-centric app. Let’s start with the first ten principles.

1. Bind your variables.

I was asked to audit a client application running on a five-node cluster. They had five nodes because the application was very CPU-intensive (which was good, because the reason to “go RAC” should be either to achieve redundancy, or to scale-out your CPU power).

Our audit uncovered huge amounts of waste. All the queries were unbound. Working on their app to bind all the variables (maybe two weeks of work), we were able to get us down to a single non-RAC database. How much was lazy coding costing this client?

What’s worse, if you use SGA_TARGET, not binding will kill your performance more than it’s already being killed by the hard parses. Your SGA gets bloated because all your SQL isn’t getting re-used, and as a result, your DB cache shrinks. It’s adding insult to injury.

2. Don’t run reports off your live tables. EVER.

This is a big one for many start-ups. I see it all the time. The have nightly reports that run and grab stuff like, “Here’s a breakdown of the membership on my site by geographic region,” or something. “Great, I’ll just run select count(*), country from member group by country;.” This kind of thing works great when you have 100 users. Now that you have 10 million, it doesn’t work so great. This is a full tablescan (usually) and it holds read consistency. This type of query will bring your site down.

Get a second server, or another DB on the same hardware, push the data there, and run your reports off it. You can run as many massive queries as you want there, but don’t touch the OLTP part of the app with them. If you need to get the results back to the live site, push them there using PL/SQL or whatever method you fancy.

Most people don’t care to see their last login date that happened three minutes ago. If you have 100k concurrent users and you’re updating columns like this (that are actually useful to see the stickiness of your site month to month) don’t do it live on login. Make it asynchronous using batch jobs.

4. If your site is read-heavy vs. write-heavy, design it appropriately with this in mind.

Do your best to avoid mixing the two types of operations. You can write on a master and read from remote sites. Oracle Advanced Replication is great for this.

5. Use connection pooling.

I/O kills a system. If you have 100k people all logging on and hitting your systems, you can’t expect to spawn a new connection for each of them. In a nutshell, each time a connection is created, Oracle has to stat all the datafiles for that spawned process. This I/O is huge, and doing it 100k times will kill you. Find a way to connection pool.

6. Write smart SQL.

Look at your explain plan and the resources your query is using. The foundation of all query performance is the buffer_get. It doesn’t matter whether it’s a disk buffer_get or a memory buffer_get. If you get your query to request the fewest buffer_gets possible, the overall performance impact will be beneficial. It all starts with the buffer_get.

Make sure you’re indexed properly, don’t use bitmaps on volatile data, use compressed indexes where appropriate, etc. Know the difference between a hash join and nested loop. This one is HUGE. You know your data better than anyone. You need to know how much data you’re grabbing from individual tables, and this should dictate what type of table access to use. Hash joins are extremely useful for very large queries.

7. Use PL/SQL.

The database is hugely powerful. Let it do some of the business logic using PL/SQL. It’s built for it.

8. Set HugePages.

9. Don’t run batch processes off live tables.

When you have tables that are actively being written into, but which you also need to process to get the results out, don’t open a cursor, process it, and then commit three minutes later. You’re holding the application up for those minutes. Create a view on the table. When you need to process, flip the view to another identical table so that the inserts can keep coming in. You’re now free to process the original table at your leisure.

10. Get the data-model right (or at least, not blatantly bad).

This is the guts of your app. Avoid circular references. Unfortunately, many constraints may need to be enforced in the application. Don’t be cheap with your columns. Don’t try to make one column do too many things. Don’t over-normalize. You get speed from having redundancy because you can avoid joins. Index properly and remember in which direction your relationships go. (I think I’ll need to make a blog post — or a few — about just this topic because it’s so important.)

I realize I’m generalizing a little. Okay, a lot. I’ll post again with a little more detail on some of these with examples of good/not-so-good code. In the meantime, I’d love to hear about tricks you have used. I myself have come across so many strange things, I’ll see if I can dig up a few horrors.