Blog

When SQL 2005 was in beta, I clapped my hands and jumped for joy the first time I played with partitioning. I’ve done a lot of data warehouse work, and partitioning lets the DBA break down massive tables into smaller, easy-to-manage chunks. We can put heavily accessed parts of the table onto fast storage, and less-accessed data onto slower, cheaper storage. The cool part is that it’s completely transparent to applications (as long as you don’t have to change primary & foreign keys): they don’t have to know the table is even partitioned.

We implemented it almost as soon as SQL 2005 came out, and I kept wondering why more people didn’t take advantage of this feature. Sure, it’s not particularly intuitive, and there’s no GUI for it in SQL Server Management Studio, but it does work.

“Can partitioning help us?”

Fast forward to yesterday: I met with Mark Schmulen and David Lyman of NutshellMail.com. It’s a slick solution for those of us who work behind corporate firewalls that prevent us from accessing personal email sites like Gmail and Yahoo, as well as personal networking sites like Facebook and Myspace. When I worked for JPMorgan Chase, I was frustrated that I couldn’t get my personal email because I had a network of DBA buddies who all knew and used my personal email address. We would ask each other for support from time to time, and I liked being able to help out. At JPMC, though, the only way I could do that was contact all my friends and have them switch to using my corporate email – which I couldn’t check when I was outside of the office, so that was a lose-lose situation for me. NutshellMail gets around that by checking your personal email accounts on a schedule you select, and then sending you digest emails to your work address. Makes perfect sense, and I’d use it in a heartbeat – even if they weren’t using SQL Server for storage. (Bonus!)

We talked scalability, and they’d heard partitioning may be a good answer for future growth. I was pleasantly surprised to hear that the word about partitioning is getting out, and that was the first time I said no, partitioning isn’t the right answer for this one particular case. Having talked through that, I wanted to touch base on some of the reasons here.

Partitioned tables require Enterprise Edition.

Standard Edition can’t do partitioning, and Enterprise is a few times more expensive than Standard. On a 2-CPU box, we’re talking about a $30-$50k difference in costs. If we’re only using Enterprise for the partitioning, then there’s a better return on investment if we put that $30-$50k into faster storage instead. This is especially relevant for servers with direct attached storage, even in a raid 10 config. I wouldn’t consider partitioning unless the server had at least 10-20 drives for the data array alone.

Partitioning best practices means using both fast and slow storage.

In the case of a data warehouse, a sales table is loaded every night with the previous day’s sales data. The most recent data is under heavy load because it’s constantly being accessed by both the ETL processes and the daily reports – because users really want to know what happened in the last few days to a month. We have to keep years and years of sales data on hand for curious users who want to check trends, though, and we don’t want to keep all of that data on extremely fast (and extremely expensive) storage. (Yes, partitioning gurus, I know there’s work involved in moving data from one array to another, but sometimes it’s worth it.) If we partitioned by date range, we could keep the new data on fast storage, and shuffle the older data off to slower storage.

Partition functions and schemes are easy to design incorrectly.

Before implementing partitioning for the first time, either get personalized advice from a DBA who’s done it several times, or get an identical development system for repeated testing. I had the luxury of an identical QA environment where I could repeatedly test different partitioning strategies and replay the end user load to see which strategy performed best. Not everybody’s that lucky (come to think of it, I should have played the lotto more) and it’s a mistake to try it on a production system first. When I first started using it, I thought I knew exactly the right strategy for our data, and we’re talking about a schema I knew forwards and backwards. I was wrong – really wrong – but at least I found the right answer before going into production.

In summary, outside of data warehouses, I like to think of partitioning as the nuclear bomb option. When things are going out of control way faster than you can handle with any other strategy, then partitioning works really well. However, it’s expensive to implement (Enterprise Edition plus a SAN) and you don’t want to see it in the hands of people you don’t trust.