I agree to TechTarget’s Terms of Use, Privacy Policy, and the transfer of my information to the United States for processing to provide me with relevant information as described in our Privacy Policy.

Please check the box if you want to proceed.

I agree to my information being processed by TechTarget and its Partners to contact me via phone, email, or other means regarding information relevant to my professional interests. I may unsubscribe at any time.

Please check the box if you want to proceed.

By submitting my Email address I confirm that I have read and accepted the Terms of Use and Declaration of Consent.

of data that enterprise IT departments must store and process. For users dealing with this data in cloud-oriented databases, learning sharding best practices is an important step in the cloud planning process.

Sharding is a process where tables are partitioned into manageable-sized disk files. Some highly elastic key-value data stores -- such as Amazon Simple DB, Google App Engine's data store or Windows Azure Tables -- and document databases like CouchDB, MongoDB or RavenDB can handle the big data in these tables with aplomb. MongoDB features built-in automatic sharding; RavenDB will gain this capability in the near future. Autosharding balances the size of shards automatically and eliminates the need for DevOps teams to oversee the process. Autosharding MongoDB databases isn't a walk in the park, as Todd Hoff reported in his October 2010 blog post on troubles with sharding.

Almost all but a few key-value and document (called NoSQL) databases lack the transactional data consistency offered by traditional relational database management systems (RDBMSs). You can scale up RDBMSs (vertically) by throwing money into more memory, processors or both. You can fit about 256 GB of RAM into higher-end commodity-grade servers, but adding substantially more CPU cores isn't always practical. If your database is in the cloud, for example, you are limited to memory and process combinations determined by the cloud provider's price list.

Scaling out RDBMSs (horizontally) imposes a substantial technical challenge. Morgan Tocker wrote a detailed blog post on why you don't want to shard MySQL databases in August 2009. Tocker acknowledged that you'll probably need to shard database tables if you encounter either:

Oversized working sets: Your working set, which usually consists of frequently accessed and updated data and indexes, won't fit in the RAM installable in an on-premise server, the amount your hardware budget will cover, or the total available from your cloud service provider. The solution is sharding.

Excessive write frequency: Your database's I/O systems can't process the number of writes/second being requested of your on-premise or cloud-based server(s). The solution is splitting out read operations into read replicas, which might require sharding to divide the I/O load across more database servers.

Simon Munro described sharding issues with relational databases in general, and Microsoft's SQL Azure customized cloud implementation of SQL Server 2008 in particular, in his "trouble with sharding" blog post in August 2009. At that time, SQL Azure's maximum database size was 10 GB; it's now 50 GB.

Scott Guthrie, the new corporate vice president of Microsoft's Azure Application Platform team, stated in a June 8 keynote to the Norwegian Developer Conference (NDC) 2011 that:

…we also do autosharding as part of SQL Azure, which means that from a scale-out perspective, we can handle super-high loads, and we do all of that kind of load-balancing and scale-out work for you.

Today, SQL Azure supports up to 50 GB of relational storage for a database, but you can have any number of databases. In the future, you'll see us support hundreds of gigabytes and terabytes [that] you can take advantage of.

More on big data and the cloud:

Autosharding with SQL Azure Federations is currently in the private Community Technical Preview (CTP) stage, and it's not evident from Guthrie's statement whether sharding will be required to "support hundreds of gigabytes and terabytes." SQL Azure Federations also promise to handle schema migrations gracefully. SQL Azure includes a primary and two secondary replicas for high availability at a basic pay-as-you-go price of $9.99 per GB per month from 1 to 5 GB (Web Edition) and $99.99 per 10GB from 10 to 50 GB per month (business databases). Fixed monthly prices don't include data transfer charges of $0.15/GB out in North American and European data centers; $0.20/GB out in Asia. Microsoft dropped data ingress charges July 1. Unlike Amazon RDS, you don't incur I/O charges for SQL Azure.

Microsoft doesn't divulge CPU and memory specifications for SQL Azure, but the company says they are commensurate to database size. You can follow SQL Azure Federations' progress toward commercial release in late 2011 at Cihan Biyikoglu's blog.

Google announced at its I/O 2011 conference in May that the commercial release of the now-beta Google App Engine -- scheduled for the latter part of this year -- will incorporate the RDBMS that was to be included with GAE for Business , but the company hadn't yet divulged feature or pricing details.

Unless Google pulls a high-scalability rabbit out of its hat its forthcoming RDBMS, be prepared to shard relational databases that must handle big data in the cloud.

ABOUT THE AUTHOR:Roger Jennings is a data-oriented .NET developer and writer, the principal consultant of OakLeaf Systems and curator of the OakLeaf Systems blog. He's also the author of 30+ books on the Windows Azure Platform, Microsoft operating systems (Windows NT and 2000 Server), databases (SQL Azure, SQL Server and Access), .NET data access, Web services and InfoPath 2003. His books have more than 1.25 million English copies in print and have been translated into 20+ languages.

0 comments

Register

Login

Forgot your password?

Your password has been sent to:

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy