Archives

Categories

Meta

Redshift: Data Warehousing at Scale in the Cloud

I’ve worked in or near the database engine world for more than 25 years. And, ironically, every company I’ve ever worked at has been working on a massive-scale, parallel, clustered RDBMS system. The earliest variant was IBM DB2 Parallel Edition released in the mid-90s. It’s now called the Database Partitioning Feature.

Massive, multi-node parallelism is the only way to scale a relational database system so these systems can be incredibly important. Very high-scale MapReduce systems are an excellent alternative formany workloads. But some customers and workloads want the flexibility and power of being able to run ad hoc SQL queries against petabyte sized databases. These are the workloads targeted by massive, multi-node relational database clusters and there are now many solutions out there with Oracle RAC being perhaps the most well-known but there are many others including Vertica, GreenPlum, Aster Data, ParAccel, Netezza, and Teradata.

What’s common across all these products is that big databases are very expensive. Today, that is changing with the release of Amazon Redshift. It’s a relational, column-oriented, compressed, shared nothing, fully managed, cloud hosted, data warehouse. Each node can store up to 16TB of compressed data and up to 100 nodes are supported in a single cluster.

Amazon Redshift manages all the work needed to set up, operate, and scale a data warehouse cluster, from provisioning capacity to monitoring and backing up the cluster, to applying patches and upgrades. Scaling a cluster to improve performance or increase capacity is simple and incurs no downtime. The service continuously monitors the health of the cluster and automatically replaces any component, if needed.

The core node on which the Redshift clusters are build, includes 24 disk drives with an aggregate capacity of 16TB of local storage. Each node has 16 virtual cores and 120 Gig of memory and is connected via a high speed 10Gbps, non-blocking network. This a meaty core node and Redshift supports up to 100 of these in a single cluster.

There are many pricing options available (see http://aws.amazon.com/redshift for more detail) but the most favorable comes in at only $999 per TB per year. I find it amazing to think of having the services of an enterprise scale data warehouse for under a thousand dollars by terabyte per year. And, this is a fully managed system so much of the administrative load is take care of by Amazon Web Services.

Fast and Powerful – Amazon Redshift uses a variety to innovations to obtain very high query performance on datasets ranging in size from hundreds of gigabytes to a petabyte or more. First, it uses columnar storage and data compression to reduce the amount of IO needed to perform queries. Second, it runs on hardware that is optimized for data warehousing, with local attached storage and 10GigE network connections between nodes. Finally, it has a massively parallel processing (MPP) architecture, which enables you to scale up or down, without downtime, as your performance and storage needs change.

You have a choice of two node types when provisioning your own cluster, an extra large node (XL) with 2TB of compressed storage or an eight extra large node (8XL) with 16TB of compressed storage. You can start with a single XL node and scale up to a 100 node eight extra large cluster. XL clusters can contain 1 to 32 nodes while 8XL clusters can contain 2 to 100 nodes.

Scalable – With a few clicks of the AWS Management Console or a simple API call, you can easily scale the number of nodes in your data warehouse to improve performance or increase capacity, without incurring downtime. Amazon Redshift enables you to start with a single 2TB XL node and scale up to a hundred 16TB 8XL nodes for 1.6PB of compressed user data. Resize functionality is not available during the limited preview but will be available when the service launches.

Inexpensive – You pay very low rates and only for the resources you actually provision. You benefit from the option of On-Demand pricing with no up-front or long-term commitments, or even lower rates via our reserved pricing option. On-demand pricing starts at just $0.85 per hour for a two terabyte data warehouse, scaling linearly up to a petabyte and more. Reserved Instance pricing lowers the effective price to $0.228 per hour, under $1,000 per terabyte per year.

Fully Managed – Amazon Redshift manages all the work needed to set up, operate, and scale a data warehouse, from provisioning capacity to monitoring and backing up the cluster, and to applying patches and upgrades. By handling all these time consuming, labor-intensive tasks, Amazon Redshift frees you up to focus on your data and business insights.

Secure – Amazon Redshift provides a number of mechanisms to secure your data warehouse cluster. It currently supports SSL to encrypt data in transit, includes web service interfaces to configure firewall settings that control network access to your data warehouse, and enables you to create users within your data warehouse cluster. When the service launches, we plan to support encrypting data at rest and Amazon Virtual Private Cloud (Amazon VPC).

Reliable – Amazon Redshift has multiple features that enhance the reliability of your data warehouse cluster. All data written to a node in your cluster is automatically replicated to other nodes within the cluster and all data is continuously backed up to Amazon S3. Amazon Redshift continuously monitors the health of the cluster and automatically replaces any component, as necessary.

Designed for use with other AWS Services – Amazon Redshift is integrated with other AWS services and has built in commands to load data in parallel to each node from Amazon Simple Storage Service (S3) and Amazon DynamoDB, with support for Amazon Relational Database Service and Amazon Elastic MapReduce coming soon.

Petabyte-scale data warehouses no longer need command retail prices of upwards $80,000 per core. You don’t have to negotiate an enterprise deal and work hard to get the 60 to 80% discount that always seems magically possible in the enterprise software world.You don’t even have to hire a team of administrators. Just load the data and get going. Nice to see.

Thanks for the mention in the list of database vendors. However, what you might have missed, is that unlike the others listed, Amazon licensed ParAccel technology to build out Redshift. In addition, ParAccel will continue to offer a higher performing, more feature rich version of our software at a very affordable price point, especially compared to the old school database vendors. We are excited that Amazon validates our technology and that they are demonstrating the amazing performance and price performance increases provided by our technology. Let me know if you’d like to discuss it further.
Regards,
John Santaferraro
Vice President of Solutions & Product Marketing
ParAccel

Great summary and congrats to both companies involved (nicely done John! ;)). Getting the DW layer commoditized is a must – it’s too hard, too expensive and to slow today. Kudos for getting the space to advance…and offering this at scale.

Now, let’s focus on the space’s bigger problem – Analytics on top of this Data. Today, there is an imbalance in focus and budget spent on the DW while most of the value comes from the Analytics. This will help getting us closer to a world focused on the end game!

Thanks for the comment Mark. I agree that this new service does change the landscape of the data warehousing market. I don’t have the aggregate IOPS and disk throughput numbers handy but my general rule of thumb is just about any single disk will do more than 80 MB/s.

John Sataferraro, the VP of ParAccel product marketing, posted that ParAccel will offer more features and higher performance at what would still be an "affordable" price. The Perspectives blog isn’t the best place to debate relative product merits and I’m not sure what to do with the comment so I’ll hold back from describing the current ParAccel performance in the market nor predicting the future.

I will say its a great time for data warehousing customers with so many choices available. Being a cloud services advocate, I’m particularly excited to see a low cost, fully managed DW offering available in the cloud with the Redshift announcement.

Bruno, I agree that analytic and reporting are also super important. DW services like Redshift make data warehousing more broadly accessible and I expect more broadly used. Customers unwilling to invest in expensive on-premise offerings can still reap the value of data warehousing. And, yes, I totally agree that many will need analytic and reporting. There is huge opportunity out there in what I expect will be a fast growing market.