Category Database

Why would anyone want to use PostgreSQL instead of SQL Server? There are a lot of factors to consider when choosing how to store your data. Sometimes we need to look deeper than the standard choice and consider something new. If you’re starting a brand new project, where should you store your data? Here are ten reasons why you might want to consider PostgreSQL over SQL Server.

Releases Every Year

Let’s face it, waiting three to five years for new functionality to roll out in any product is painful. I don’t want to constantly be learning new functionality, but on the flip side I don’t want to be writing hack solutions to critical business problems because I know something is coming down the pipe, but I can’t wait for a few more years before I implement it myself. Rapid release cycles guarantee that the PostgreSQL development team is able to quickly ship the features that users need and make frequent improvements.

Starting with version 9.0, the PostgreSQL release cycle has switched to a yearly cycle. Before that, PostgreSQL released whenever the features were done. Looking at the major releases on Wikipedia, it’s obvious that major releases still rolled out about once every 18 months. An 18 month release cycle isn’t bad for any software product, much less a mission critical one like a database.

True Serialization

Snapshot isolation guarantees that all reads in a transaction see a consistent snapshot of data. In addition, a transaction should only commit if the ways that it changes data don’t conflict with other changes made since the snapshot was taken. Unfortunately, snapshots allow anomalies to exist. It’s possible to create a situation where two valid transactions occur that leave the database in an inconsistent state – the database doesn’t pass its own rules for data integrity.

Serializable snapshot isolation was added to PostgreSQL in version 9.1. SSI emulates strict serial execution – transactions behave as if they are executing one after another. If there is a conflict, or even a potential conflict, the database engine throws an error back to the caller (who is left to figure out the appropriate next step).

Serializable snapshot isolation sounds painful. The kicker is that it makes it possible for databases to behave in ways that work to guarantee an even stronger level of consistency. Applications can be developed to assume that data modification will fail and subsequently retry failed transactions. The true benefit is that well written software can avoid data inconsistencies and maintain the illusion that all is operating as it should be.

Sane Defaults, Ridiculous Tuning

Okay, to be fair PostgreSQL ships with some ridiculously conservative shared memory settings. Most other PostgreSQL settings are conservative, but general enough for most generic workloads. Many people deploying PostgreSQL will not have to make many changes to PostgreSQL (probably just increasing shared_buffers to 25% of total RAM to start).

Once a PostgreSQL installation is up and running, there are a number of settings that can be changed. The best part, though, is that most of these settings can be changed at the server, database, user, or even individual query level. It’s very common to have mixed workload servers – most activity on the server is basic CRUD, but a small percentage of activity are reports that need to be aggressively tuned. Instead of moving the individual reports out to running on separate space (either separate servers, databases, or even in separate resource pools in the same database), we can simply tune a few queries to use the appropriate parameters including the memory to allocate for sorting and joins.

Unlogged Tables

Are you sick of trying to get minimally logged bulk inserts to work? Me too. Instead of trying various mechanisms to minimally log some tables, PostgreSQL give us option of creating an unlogged table – simply add the UNLOGGED directive to a create table statement and everything is ready to go.

Unlogged tables bypass the write ahead log; they aren’t crash safe, but they’re stupid fast. Data in an unlogged table will be truncated after the server crashes or there is an unclean shutdown, otherwise it’ll still be there. They’re also excluded from replication to a standby server. This makes unlogged tables ideal for ETL or other data manipulation processes that can easily be repeated using source data.

KNN for Geospatial… and More

Yeah, I hear ya, SQL Server will have this soon, but PostgreSQL already has it. If K Nearest Neighbor searches are critical for your business, you’ve already gone through some pain trying to get this working in your RDBMS. Or you’ve given up and implemented the solution elsewhere. I can’t blame you for that – geospatial querying is nice, but not having KNN features is a killer.

PostgreSQL’s KNN querying works on specific types of indexes (there are a lot of index types in PostgreSQL). Not only can you use KNN querying to find the 5 nearest Dairy Queens, but you can also use a KNN search on other data types. It’s completely possible to perform a KNN search and find the 10 phrases that are closest to “ice cream”.

KNN search capability makes PostgreSQL a serious contender for anyone looking at implementing geospatial querying. The additional flexibility puts PostgreSQL in a leadership position for many other kinds of search driven applications.

Transaction-Controlled Synchronous Replication

One of the easiest ways to keep another copy of your database is to use some kind of database replication. SQL Server DBAs will largely be used to transactional replication – a dedicated agent reads the SQL Server log, collects outstanding commands, and then ships them over to the subscriber where they are applied.

PostgreSQL’s built-in replication is closer to SQL Server’s mirroring than SQL Server’s replication (PostgreSQL’s replication has a readable standby). Log activity is hardened on the primary and then streamed to the secondary. This can either happen synchronously or asynchronously. Up until PostgreSQL 9.1, replication was an all or nothing affair – every transaction was either synchronous or asynchronous. Developers can set a specific transaction by setting the synchronous_replication configuration value for that single transaction. This is important because it makes it possible to write copious amounts of data to logging tables for debugging purposes but not have performance be impacted by synchronously committing writes to the log tables.

Any time we have more choice in how we develop our applications, I’m happy.

Writeable CTEs

CTEs are great for reads, but if I need to do something more complex with them, there are other issues involved. An example is going to make this much easier. Let’s say I want to delete stale data, but I want to store it in an archive table. To do this with SQL Server, the easiest route (from a development standpoint) is going to be to elevate my isolation level to at least snapshot, if not serializable, and use isolation levels to guarantee that no data will be changed. I could also load the PK value of the comments to be deleted into a temp table and reference that multiple times.

Both methods work, but both methods have problems. The first requires that the code be run in a specific isolation level. This relies on specific settings to be in place that may not be available. The code could also be copied out of the procedure and run in SSMS, leading to potential anomalies where a few rows are deleted but not archived. That’s no big deal for spam comments, but it could be critical in other situations. The second method isn’t necessarily bad, there’s nothing wrong with it, but it involves extra code noise. That temporary table isn’t necessary to solve our problem and is a byproduct of dealing with different isolation levels.

PostgreSQL has a different way to solve this problem: writeable CTEs. The CTE is constructed the same way it would be constructed in T-SQL. The difference is that when we’re using PostgreSQL, the data can be modified inside the CTE. The output is then used just like like the output of any other CTE:

This can be combined with default values, triggers, or any other data modification to build very rich ETL chains. Under the covers it may be doing the same things that we’ve outlined from SQL Server, but the conciseness is beneficial.

Extensions

Ever want to add some functionality to SQL Server? What about keep up to date on that functionality? This can be a huge problem for DBAs. It’s very easy to skip a server when you roll out new administrative scripts across your production environment. Furthermore, how do you even know which version you have installed?

The PostgreSQL Extension Network is a centralized repository for extra functionality. It’s a trusted source for open source PostgreSQL libraries – no sneaky binaries are allowed. Plus, everything in PGXN is versioned. When updating PGXN provided functionality, the extension takes care of the update path for you – it knows how to make sure it’s up to date.

Pushing this functionality out into extensions makes it easy for developers and DBAs to build custom packages that look and act like core functionality of PostgreSQL without trying to get the package through the PostgreSQL release process. These packages can then be developed independently, advance at their own rate, and provide complex functionality that may not fit within the release plan of the PostgreSQL core team. In short, there’s a healthy ecosystem of software being built around PostgreSQL.

Rich Temporal Data Types

One of my favorite features of PostgreSQL is the rich support for temporal data types. Sure, SQL Server 2008 finally brought some sophistication to SQL Server’s support for temporal data, but it’s still a pretty barren landscape. Strong support for temporal data is critical in many industries and, unfortunately, there’s a lot of work that goes on in SQL Server to work around the limitations of SQL Server’s support for temporal data.

PostgreSQL brings intelligent handling of time zones. In addition to supporting the ISO 8601 standard (1999-01-08 04:05:06 -8:00), PostgreSQL supports identifying the time zone by an abbreviation (PST) or by specifying a location identifier (America/Tijuana). Abbreviations are treated like a fixed offset from UTC, location identifiers change with daylight savings rules.

On top of time zone flexibility, PostgreSQL has an interval data type. The interval data type is capable of storing an interval of up to 178,000,000 years with precision out to 14 digits. Intervals can measure time at a number of precisions from as broad as a year to as narrow as the microsecond.

Exclusion Constraints

Have you ever tried to write any kind of scheduling functionality using SQL Server? If you have, you’ll know that when you have business requirements like “two people cannot occupy the same conference room at the same time”, you’ll know that this difficult to enforce with code and usually requires additional trips to the database. There are many ways to implement this purely through application level code and none of them lead to happy users or developers.

PostgreSQL 9.0 introduced exclusion constraints for columns. In short, we define a table and then add an additional constraint that includes a number of checks where at least one of the checks is false. Exclusion constraints are supported under the hood by indexes, so these operations are as quick as our disks and the index that we’ve designed. It’s possible to use exclusion constraints in conjunction with temporal or geospatial data and make sure that different people aren’t reserving the same room at the same time or that plots of land don’t overlap.

There was a presentation at the 2010 PGCon that going into the details of exclusion constraints. While there is no video, the slides are available and they contain enough examples and explanations to get you started.

Bonus Feature – Cost

It’s free. All the features are always there. There are no editions of PostgreSQL – the features always exist in the database. Commercial support is available from a number of companies, some of them even provide additional closed source features, but the core PostgreSQL database is always available, always free, and always contains the same features.

I’ve already talked about the implications of the first definition – it’s leading us down the path of a monolithic master server that must be aware of the other servers in the federation. New servers won’t be a part of the federation until we make the federating server aware of them. The other road, a loose collection of independent servers, is beginning to gain ground as companies bring more databases online in their data centers. When I say “more databases” I’m not just talking about a larger number of databases from a vendor, I’m also talking about databases from different vendors. Many people are exploring this route right now, some of them have attached the name of polyglot persistence to this approach.

Why Do I Want a Federated Database?

There are a couple of reasons that you’d want to roll your own federated database. I touched on them in the first part of this series: you may want to query across databases, you might have legacy systems, you may have merged with another company, or you might be using the most appropriate database for the job. Whatever you’re doing, you probably have a number of databases and you need to stitch them together.

Where Do I Get Started?

There are a number of ways we could go about creating a federated database. A lot of the ways to build a federated database solution are incredibly complex and involve creating meta-data databases as well as devising ways to link the databases together in an easily query-able way. I’m going to propose something different. Instead of designing something on your own, use the technology you already have and that your programming platform already comes with and understands: TCP/IP.

Nearly every programming language is capable of talking to other programs over TCP/IP. Instead of creating custom databases and worrying about meta-data management and cross server querying, create common services that answer common questions. Break your monolithic application down into manageable services and write those services using the most appropriate technology. Over my career, I’ve found that very few users need the ability to run ad hoc reports over the entire corporate data set. If users don’t need to be able to interactively query the entirety of their corporate data set, what do they need?

Almost all users need a small set of reports and data. Even when we expand the definition of “users” to include applications, services, APIs, and protocols, most activities are incredibly limited. Our users are asking the same sets of questions: How many accounts receivable have aged more than 30 days? What do the sales figures for the New England region look like for the last three years? Even when users are adding data to our databases they’re still performing a limited set of actions like saving an entire order, signing up for a new account, or adding a new accordion to their shopping cart. The activities that users perform data are very limited.

Knowing that users only perform a few activities with our live data, we can safely make some assumptions about the type of data access people will need. Keeping that in mind, it’s a lot easier to see how we can build our own federated database: we’re not going to. We’re going to build our own system using what many people call polyglot persistence.

Designing for Polyglot Persistence

The idea behind polyglot persistence is that we keep our data in the best database for storing that particular kind of data. Achieving this goal is achievable, but if that were the end game, it wouldn’t be the most useful goal for the business – business users want to see reports and combine data across applications and business units.

Going one step beyond the basics of polyglot persistence, we want to add another layer – a caching/service layer. It’s in this layer that we can start to really add rich functionality to the data that the business needs. Instead of having to replicate data across multiple data sources, we can query two separate servers and combine the data together before we return it to the client.

We’ve been doing this for years – it’s nothing new. The only thing that is remotely new is storing our data in the most suitable database. Well, that and telling our caching/service layer to cache as much data as possible while writing in the background. If we keep most of our data in cache, we don’t have to worry as much about write performance in the back end. We can queue writes to make sure they commit during idle times, we can spread them across many servers, and we can write to many reporting databases at once to make sure that reports are up to date. By moving application and reporting logic into an application and reporting tier, we free the database to focus on the tasks that databases excel at: storing and retrieving data. Complex logic and strange data mucking can be handled in the application layer by simple (or highly specialized) algorithms.

Polyglot persistence becomes incredibly valuable when we build mechanisms to load data from all of our disparate line of business systems into a single enterprise data warehouse. Once we have all of our data in a single warehouse, we’re able to write queries across business boundaries. The enterprise warehouse doesn’t need to be in a single monolithic RDBMS server; it could use Microsoft SQL Server Parallel Data Warehouse, Oracle ExaData, Postges-XC, HBase, Cassandra, or any other database that is up to the task.

Wrapping it up

Polyglot persistence seems to be the best answer to building a federated database. It doesn’t provide any kind of automated meta-data management or support for distributing queries automatically across many servers. Instead, polyglot persistence makes it easier to build a robust system that answers the questions business users both want and need while remaining fast and flexible. Is it the be all end all solution? No. Is it a step in the right direction? Yes.

Federated databases are a dream that have not materialized. The SQL/MED extension to the ANSI/ISO SQL specification is a step in the right direction. In addition, both SQL Server and Oracle have proprietary extensions that make it possible to query external data sources. If all of this technology is available today, why aren’t more people using it?

Why Don’t We Have a Federated Database?

If federated databases are such a powerful thing, why hasn’t anyone built one? Surely this is something that many businesses are clamoring for, or at least in need of.

There are a number of problems facing anyone attempting to implement a federated database. Frankly, federated databases are fraught with technical difficulties. Every database vendor supports a different subset of the ANSI/ISO SQL standard, different vendors use different data types and metadata, concurrency is a huge concern with dealing with remote resources, and technology is a moving target.

Different Dialects of SQL

Anyone who has attempted to port an application from one database engine to another knows about the pain involved in translating queries. Different vendors adhere to the SQL standard to varying degrees. This problem can be solved by creating wrappers to translate sub-queries between different querying languages, but it’s still a problem that exists. This problem could be partially solved in a federated database by limiting the database vendors to a small subset of the ANSI/ISO SQL standard, but this doesn’t solve the problem, it merely avoids it by limiting functionality.

Different Metadata

Different databases may have different metadata – different data types are used to represent the same data and different structures are used to describe data. To solve this problem it becomes necessary to create elaborate mappings between columns that represent the same data. There may be situations where such a mapping becomes computationally intensive or functionally impossible. SQL Server allows the creation of .NET data types with custom methods for data searching, access, and manipulation. PostgreSQL features several data types (notably tsvector, hstore, and arrays) that would be difficult to convert to data types in other databases.

Metadata differences don’t just stop at the data type level. It’s possible to model data in a number of different ways; the type of an address could be indicated using an integer key value that references a lookup table in one database or as a VARCHAR column with values constrained by the database. It’s even possible for something as simple as Unicode text encodings to cause problems: SQL Server uses the NVARCHAR data type for storing Unicode strings while other databases do not use a separate data type.

Concurrency

Concurrency, depending on your database, may be a concern. Managing concurrent operations within a single database is a difficult task, much less managing concurrency across multiple databases. Unfortunately, correct handling of concurrency across all components of a federated database is critical.

Many potential problems of a federated database can be solved through different trade offs. However, managing concurrency is a nearly impossible task. To properly and effectively manage concurrency across multiple databases is to ask the impossible. Not only would this require the federated database vendor be able to account for all possible concurrency issues in relational databases, but they would need to be able to account for potential concurrency issues in any database that integrates with the federated database.

The Moving Target

Even were a database vendor to take on this task, they’d be consistently aiming for a moving target. New features are added to relational databases all the time, and there are enough major players in the market to make it difficult for users to keep up to date, much less a federated database vendor. Once you factor in the wealth of other, non-relational databases, the idea of creating a federated database system to handle metadata mapping, concurrency control, and query language resolution trends toward impossible.

The State of The Industry

Where we stand now, there is almost no chance of any independent software vendor creating a true federated database. There is hope, but not from where you would expect it.

Enterprise data warehouses can fulfill much of the function of a federated database, but they still require complex ETL and data mapping to be truly useful. Adding additional information to an enterprise data warehouse can require extra work to prepare the data warehouse and ETL processes for the new data. Unfortunately, enterprise data warehouses require too much manual intervention to be a candidate for a federated database.

As we’ve discussed, SQL Server and Oracle provide ways to reference remote database servers. These methods have their own problems. SQL Server linked tables are prone to problems with some objects not being remoteable. When we’re querying a remote server, we need to make sure that the parts of our query going to the remote server are handled on the remote server. This is difficult to get right. On some occasions we might even see an entire remote table be streamed across the wire to be filtered on the originating server. This is something that we don’t want to see happening. For a federated database to be a tenable product there must be an easy way to offload queries to a remote table and a guarantee of adequate performance.

Properly remoting queries is incredibly complex. Assume, for a moment, that we have a report that queries data on the sales department’s database server and we also need to include data from human resources data. Our query might look something like this:

Looking at this query we’re hitting two separate remote objects in one remote database. In order for this query to be effective, our query optimizer must b able to re-write the query in a way that lets it build an intelligent query for the two remote tables HRDB01.employee_info.public.employee_history and HRDB01.employee_info.reviews.review. But, in order to effectively re-write the local query to properly reference remote objects, we need to know everything about the remote objects – our calling server must be aware of as much metadata as possible so the remoted query can be re-written before being sent to the remote server. While this is doable, it puts additional load on the calling server. This server now has to maintain information about remote database objects. But that’s not all! If we want our queries to be truly optimal, our federating sever will need to be aware of how data types will behave on the remote servers and how the remote data types will interact with local data types.

Once you examine the intricacies of a federated database, it becomes obvious why the federated database, as a boxed product, is beyond our reach. It’s not that the task is impossible; on the contrary such a task is very possible. The difficulty lies in coordinating all of the information available and using it to deliver data quickly. There are enough moving and potentially unknown parts that it’s non-trivial to create heterogeneous systems capable of filling out the promise of federated databases. Faced with this situation, the only viable solution is to build your own solution that answers the needs of the business.

One of the biggest problems facing businesses today is integrating data from multiple sources. The holy grail of data integration is called a federated database system. Basically, a federated database stores meta data about other databases and makes it easier to integrate them through a single interface. Many relational databases have features that support integrating with other relational databases through SQL Server’s linked servers or Oracle’s database links. One of the problems with these features is that they only allow relational databases to talk to other relational databases. As the volumes of data we collect every day increase, more and more of that data is being stored outside of relational databases in CSVs, spreadsheets, log files, PDFs, and plain text as well as in a variety of non-relational databases like MongoDB, HBase, Riak, and Cassandra.

If the only thing we’re looking for is access to meta data for external data, SQL Server will provide the [FileTable][8] data type in SQL Server 2011. Admittedly, FileTable isn’t an acceptable solution because it’s really intended to make it possible to reference files in the database that are being managed by external applications through the filesystem and Win32 APIs. Clearly, this doesn’t suit our need for querying external data.

It’s also possible to use ETL tools to move data into relational databases. One of the problems with ETL tools (SSIS, Pentaho Data Integration/Kettle, or Oracle Data Integrator) is that they are effectively batch operations. New data insertions will have to be triggered by some external event; the data isn’t available until it’s migrated into some master system.

Luckily, there’s an extension to the SQL Standard to help us: SQL/MED. MED stands for Management of External Data. This is a way to link up any external data source to a database server. It doesn’t have to be another relational database – there’s already a twitter foreign data wrapper library. Unfortunately, PostgreSQL is currently the only major database player on the market with any potential for an implementation for SQL/MED.

The implementation of SQL/MED just isn’t here, yet.

Another promising project is HadoopDB. HadoopDB is a project coming out of Yale University. The aim is to make it possible to run analytical workloads in parallel across many commodity RDBMS servers. One of the goals of HadoopDB is to excel in areas where parallel data warehouses simple do not perform well. Many of these situations are outline in the paper HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. To summarize, parallel data warehouses provide near linear scaling up to several hundred nodes running on homogeneous hardware. Parallel data warehouse also operate under the assumption that failures are rare. Google and others have demonstrated that hardware failure is inevitable at scale. HadoopDB presents a phenomenal way to scale databases and integrate disparate technologies.

Despite its promise of scaling databases, HadoopDB still doesn’t solve the problems that we face when trying to build a federated database system. The truth is a depressing one – there is currently no solution for building federated databases that incorporate data from across the enterprise. Database vendors, DBAs, and more traditional corporate IT departments will tell you that this is a Good Thing™. I’ve mentioned before that you should choose the database that is best suited for the task at hand.

Where do we go from here? SQL/MED doesn’t meet its own promises – only one vendor is implementing the SQL/MED standard and that support is going to depend on third parties releasing drivers. HadoopDB isn’t a federated database so much as it is a way to avoid scaling a relational database into thousands of cheap nodes and paying millions of dollars in licensing fees for Teradata, Microsoft’s Parallel Data Warehouse, or Oracle’s Exadata. The unfortunate truth is that if we want a federated database we’re going to have to build it ourselves.

What sounds like a Sisyphean task isn’t as difficult as it sounds. If we’re collecting data in multiple databases, the problem is already almost solved. Some of those technologies are already here. LINQ lets us treat all data sources equally; we can query an array of objects as easily as we can query a database. ARel is a relational algebra for Ruby. While ARel is specifically focused on working with relational databases, it could be extended to work with many different data sources. Business intelligence vendor Jaspersoft recently announced support for a number of non-relational databases to complement their existing business intelligence products. Quest Software, makes Toad for Cloud Databases – a tool for querying both relational and non-relational databases.

A federated database may never materialize, but it’s already possible to build a hybrid database solution today.

A while back we talked about getting faster writes with Riak. Since then, I’ve been quiet on the Riak front. Let’s take a look at how we can get data out of Riak, especially since I went to great pains to throw all of that data into Riak as fast as my little laptop could manage.

Key filtering is a new feature in Riak that makes it much easier to restrict queries to a subset of the data. Prior to Riak 0.13, it was necessary to write MapReduce jobs that would scan through all of the keys in a bucket. The problem is that the MapReduce jobs end up loading both the key and the value into memory. If we have a lot of data, this can cause a huge performance hit. Instead of loading all of the data key filtering lets us look at the keys themselves. We’re pre-processing the data before we get to our actually query. This is good because 1) software should do as little as possible and 2) Riak doesn’t have secondary indexing to make querying faster.

Here’s how it works: Riak holds all keys in memory, but the data remains on disk. The key filtering code scans the keys in memory on the nodes in our cluster. If any keys match our criteria, Riak will pass them along to any map phases that are waiting down the pipe. I’ve written the sample code in Ruby but this functionality is available through any client.

We’re going to iterate over every key value pair in the animals bucket and look for a key that contains the word ‘spider’. Once we find that key, we’re going to return a single element array containing the number 1. Once the map phase is done, we use the built-in function Riak.reduceSum to give us a sum of the values from the previous map phase. We’re generating a count of the records that match our data – how many spiders do we really have?

It’s not that much different – the map query has been greatly simplified to just return [1] on success and the search criteria has been moved into the inputs portion of the query. The big difference is in the performance: the key filter query is 26 times faster.

This is a simple example, but a 26x improvement is nothing to scoff at. What it really means is that the rest of our MapReduce needs to work on a smaller subset of the data which, ultimately, makes things faster for us.

A Different Way to Model Data

Now that we have our querying basics out of the way, let’s look at this problem from a different perspective; let’s say we’re tracking stock performance over time. In a relational database we might have a number of tables, notably a table to track stocks and a table to track daily_trade_volume. Theoretically, we could do the same thing in Riak with some success, but it would incur a lot of overhead. Instead we can use a natural key to locate our data. Depending on how we want to store the data, this could look something like YYYY-MM-DD-ticker_symbol. I’ve created a script to load data from stock exchange data. For my tests, I only loaded the data for stocks that began with Q. There’s an a lot of data in this data set, so I kept things to a minimum in order to make this quick.

Since our data also contains the stock exchange identifier, we could even go one step further and include the exchange in our key. That would be helpful if we were querying based on the exchange.

If you take a look at [mr_stocks.rb][8] you’ll see that we’re setting up a query to filter stocks by the symbol QTM and then aggregate the total trade volume by month. The map phase creates a single cell array with the stock volume traded in the month and returns it. We use the Riak.mapValuesJson function to map the raw data coming in from Riak to a proper JavaScript object. We then get the month that we’re looking at by parsing the key. This is easy enough to do because we have a well-defined key format.

If we were to look at this output we would see a lot of rows of unaggregated data. While that is interesting, we want to look at trending for stock trades for QTM over all time. To do this we create a reduce function that will sum up the output of the map function. This is some pretty self explanatory JavaScript:

Okay, so that might not actually be as self-explanatory as anyone would like. The JavaScript reduce method is a newer one. It will accumulate a single result (the acc variable) for all elements in the array. You could use this to get a sum, an average, or whatever you want.

One other thing to note is that we use parseInt. We probably don’t have to use it, but it’s a good idea. Why? Riak is not aware of our data structures. We just store arrays of bytes in Riak – it could be a picture, it could be text, it could be a gzipped file – Riak doesn’t care. JavaScript only knows that it’s a string. So, when we want to do mathematical operations on our data, it’s probably wise to use parseInt and parseFloat.

Where to Now?

Right now you probably have a lot of data loaded. You have a couple of options. There are two scripts on github to remove the stock data and the animal data from your Riak cluster. That’s a pretty boring option. What can you learn from deleting your data and shutting down your Riak cluster? Not a whole lot.

You should open up mr_stocks.rb and take a look at how it works. It should be pretty easy to modify the map and reduce functions to output total trade volume for the month, average volume per day, and average price per day. Give it a shot and see what you come up with.

If you have questions or run into problems, you can hit up the comments, the Riak Developer Mailing List, or hit up the #riak IRC room on irc.freenode.net if you need immediate, real time help with your problem.

A friend of mine half-jokingly says that the only reason to put data into a database is to get it back out again. In order to get data out, we need to ensure some kind of durability.

Relational databases offer single server durability through write-ahead logging and checkpoint mechanisms. These are tried and true methods of writing data to a replay log on disk as well as caching writes in memory. Whenever a checkpoint occurs, dirty data is flushed to disk. The benefit of a write ahead log is that we can always recover from a crash (so long as we have the log files, of course).

How does single server durability work with non-relational databases? Most of them don’t have write-ahead logging.

MongoDB currently has limited single server durability. While some people consider this a weakness, it has some strengths – writes complete very quickly since there is no write-ahead log that needs to immediately sync to disk. MongoDB also has the ability to create replica sets for increased durability. There is one obvious upside to replica sets – the data is in multiple places. Another advantage of replica sets is that it’s possible to use getLastError({w:...}) to request acknowledgement from multiple replica servers before a write is reported as complete to a client. Just keep in mind that getLastError is not used by default – application code will have to call the method to force the sync.

Setting a w-value for writes is something that was mentioned in Getting Faster Writes with Riak. Although, in that article we were decreasing durability to increase write performance. In Amazon Dynamo inspired systems writes are not considered complete until multiple clients have responded. The advantage is that durable replication is enforced at the database and clients have to elect to use less security for the data. Refer to the Cassandra documentation on Writes and Consistency or the Riak Replication documentation for more information on how Dynamo inspired replication works. Datastores using HDFS for storage can take advantage of HDFS’s built-in data replication.

Even HBase, a column-oriented database, uses HDFS to handle data replication. The trick is that rows may be chopped up based on columns and split into regions. Those regions are then distributed around the cluster on what are called region servers. HBase is designed for real-time read/write random-access. If we’re trying to get real-time reads and writes, we can’t expect HBase to immediately sync files to disk – there’s a commit log (RDBMS people will know this as a write-ahead log). Essentially, when a write comes in from a client, the write is first written to the commit log (which is stored using HDFS), then it’s written in memory and when the in-memory structure fills up, that structure is flushed to the filesystem. Here’s something cunning: since the commit log is being written to HDFS, it’s available in multiple places in the cluster at the same time. If one of the region servers goes down it’s easy enough to recover from – that region server’s commit log is split apart and distributed to other region servers which then take up the load of the failed region server.

Microsoft’s Azure environment also has redundancy built in. Much like Hadoop, the redundancy and durability is baked into Azure at the filesystem. Building the redundancy at such a low level makes it easy for every component of the Azure environment to use it to achieve higher availability and durability. The Windows Azure Storage team have put together an excellent overview. Needless to say, Microsoft have implemented a very robust storage architecture for the Azure platform – binary data is split into chunks and spread across multiple servers. Each of those chunks is replicated so that there are three copies of the data at any given time. Future features will allow for data to be seamlessly geographically replicated.

Even SQL Azure, Microsoft’s cloud based relational database, takes advantage of this replication. In SQL Azure when a row is written in the database, the write occurs on three servers simultaneously. Clients don’t even see an operation as having committed until the filesystem has responded from all three locations. Automatic replication is designed into the framework. This prevents the loss of a single server, rack, or rack container from taking down a large number of customers. And, just like in other distributed systems, when a single node goes down, the load and data are moved to other nodes. For a local database, this kind of durability is typically only obtained using a combination of SAN technology, database replication, and database mirroring.

There is a lot of solid technology backing the Azure platform, but I suspect that part of Microsoft’s ultimate goal is to hide the complexity of configuring data durability from the user. It’s foreseeable that future upgrades will make it possible to dial up or down durability for storage.

While relational databases are finding more ways to spread load out and retain consistency, there are changes in store for MongoDB to improve single server durability. MongoDB has been highly criticized for its lack of single server durability. Until recently, the default response has been that you should take frequent backups and write to multiple replicas. This is still a good idea, but it’s promising to see that the MongoDB development team are addressing single server durability concerns.

Why is single server durability important for any database? Aside from guaranteeing that data is correct in the instance of a crash, it also makes it easier to increase adoption of a database at the department level. A durable single database server makes it easy to build an application on your desktop, deploy it to the server under your desk, and move it into the corporate data center as the application gains importance.

Logging and replication are critical technologies for databases. They guarantee data is durable and available. There are also just as many options as there are databases on the market. It’s important to understand the requirements of your application before choosing mechanisms to ensure durability and consistency across multiple servers.

Sawzall, WTF?

Apart from a tool that I once used to cut the muffler off of my car (true story), what is Sawzall?

Sawzall is a procedural language for analyzing excessively large data sets. When I say “excessively large data sets”, think Google Voice logs, utility meter readings, or the network traffic logs for the Chicago Public Library. You could also think of anything where you’re going to be crunching a lot of data over the course of many hours on your monster Dell R910 SQL Server.

Spell It Out for Me

At its most basic, Sawzall is a MapReduce engine, although the Google documentation goes to great pains to not use the word MapReduce, so maybe it’s not actually MapReduce. It smells oddly like MapReduce to me.

I’ll go into more depth on the ideas behind MapReduce in the future, but here’s the basics of MapReduce as far as Sawzall is concerned:

Data is split into partitions.

Each partition is filtered. (This is the Map.)

The results of the filtering operation are used by an aggregation phase. (This is the Reduce.)

The results of the aggregation are saved to a file.

It’s pretty simple. That simplicity makes it possible to massively parallelize the analysis of data. If you’re in the RDBMS world, think Vertica, SQL Server Parallel Data Warehouse, or Oracle Exadata. If you are already entrenched and in love with NoSQL, you already know all about MapReduce and probably think I’m an idiot for dumbing it down so much.

The upside to Sawzall’s approach is that rather than write a Map program and a Reduce program and a job driver and maybe some kind of intermediate aggregator, you just write a single program in the Sawzall language and compile it.

… And Then?

I don’t think anyone is sure, yet. One of the problems with internal tools is that they’re part of a larger stack. Sawzall is part of Google’s internal infrastructure. It may emit compiled code, but how do we go about making use of those compiled programs in our own applications? Your answer is better than mine, most likely.

Sawzall uses something called Protocol Buffers – PB is a cross language way to efficiently move objects and data around between programs. It looks like Twitter is already using Protocol Buffers for some of their data storage needs, so it might only be a matter of time before they adopt Sawzall – or before some blogger opines that they might adopt Sawzall .

So far nobody has a working implementation of Sawzall running on top of any MapReduce implementations – Hadoop, for instance. At a cursory glance, it seems like Sawzall could be used in Hadoop Streaming jobs. In fact, Section 10 of the Sawzall paper seems to point out that Sawzall is a record by record analytical language – your aggregator needs to be smart enough to handled the filtered records.

Why Do I Need Another Language?

This is a damn good question. I don’t program as much as I used to, but I can reasonably write code in C#, Ruby, JavaScript, numerous SQL dialects, and Java. I can read and understand at least twice as many languages. What’s the point of another language?

One advantage of a special purpose language is that you don’t have to worry shoehorning domain specific functionality into existing language constructs. You’re free to write the language the way it needs to be written. You can achieve a wonderful brevity by baking features into the language. Custom languages let developers focus on the problems at hand and ignore implementation details.

What Now?

You could download the code from the Google Code repository, compile it, and start playing around with it. It should be pretty easy to get up and running on Linux systems. OS X developers should look at these instructions from a helpful Hacker News reader. Windows developers should install Linux on a VM or buy a Mac.

Outside of downloading and installing Sawzall yourself to see what the fuss is about, the key is to keep watching the sky and see what happens.

You know that you should be testing your code. You even know that you should be testing your SQL. But why? We need to make sure that changes to our code are safe, prevent regressions, and that we catch edge cases.

But are you testing your code for performance?

Are you testing for performance? You can bet these people are.

Changes to code can make your code faster or slower, depending on indexing as well as user defined functions and built-in functions. Different computations can result in different in different execution plans. If changes to your code can cause drastic changes to your application performance, why aren’t you monitoring the performance of your code?

Test frameworks, like T-SQL Unit, make it possible to wrap the execution of your stored procedures in other processes. By taking advantage of these hooks it’s possible to time the execution of each procedure and record the results in a table (possibly even correlating each run to the appropriate version from source control). You can see how query performance changes over time.

Testing your code is important – you can prevent changes from causing both logical and performance problems.

People have chimed in and talked about the Foursquare outage. The nice part about these discussions is that they’re focusing on the technical problems with the current set up and Foursquare. They’re picking it apart and looking at what is right, what went wrong, and what needs to be done differently in MongoDB to prevent problems like this in the future.

Let’s play a “what if” game. What if Foursquare wasn’t using MongoDB? What if they were using something else?

Riak

Riak is a massively scaleable key/value data store. It’s based on Amazon’s Dynamo. If you don’t want to read the paper, that just means that it uses some magic to make sure that data is evenly spread throughout the database cluster and that adding a new node makes it very easy to rebalance the cluster.

What would have happened at Foursquare if they had been using Riak?

Riak still suffers from the same performance characteristics around disk access as MongoDB – once you have to page to disk, operations become slow and throughput dries up. This is a common problem in any software that needs to access disks – disks are slow, RAM is fast.

Riak, however, has an interesting distinction. It allocates keys inside the cluster using something called a consistent hash ring – this is just a convenient way to rapidly allocate ranges of keys to different nodes within a cluster. The ring itself isn’t interesting. What’s exciting is that the ring is divided into partitions (64 by default). Every node in the system claims an equal share of those 64 partitions. In addition, each partition is replicated so there are always three copies of a give key/value pair at any time. Because there are multiple copies of the data, it’s unlikely that any single node will fail or become unbalanced. In theory, if Foursquare had used Riak it is very unlikely that we’ll run into a problem were a single node becomes full.

How would this consistent hash ring magical design choice have helped? Adding a new node causes the cluster to redistribute data in the background. The new node will claim and equal amount of space in the cluster and the data will be redistributed from the other nodes in the background. The same thing happens when a node fails, by the way. Riak also only stores keys in memory, not all of the data. So it’s possible to reference an astronomical amount before running out of RAM.

There’s no need to worry about replica sets, re-sharding, or promoting a new node to master. Once a node joins a riak cluster, it takes over its share of the load. As long as you have the network throughput on your local network (which you probably do), then this operation can be fairly quick and painless.

Cassandra

Cassandra, like Riak, is based on Amazon’s Dynamo. It’s a massively distributed data store. I suspect that if Foursquare had used Cassandra, they would have run into similar problems.

Cassandra makes use of range partitioning to distribute data within the cluster. The Foursquare database was keyed off of the user name which, in turn, saw abnormal growth because some groups of users were more active than others. Names also tend to clump around certain letters, especially when you’re limited to a Latin character set. I know a lot more Daves that I know Zachariahs, and I have a lot of friends whose names start with the letter L. This distribution causes data within the cluster to be overly allocated to one node. This would, ultimately, lead to the same problems that happened at Foursquare with their MongoDB installation.

That being said, it’s possible to use a random partitioner for the data in Cassandra. The random partitioner makes it very easy to add additional nodes and distribute data across them. The random partitioner comes with a price. It makes it impossible to do quick range slice queries in Cassandra – you can no longer say “I want to see all of the data for January 3rd, 2010 through January 8th, 2010”. Instead, you would need to build up custom indexes to support your querying and build batch processes to load the indexes. The tradeoffs between the random partitioner and the order preserving partitioner are covered very well in Dominic Williams’s article Cassandra: RandomPartitioner vs OrderPreservingPartitioner.

Careful use of the random partitioner and supporting batch operations could have prevented the outage that Foursquare saw, but this would have lead to different design challenges, some of which may have been difficult to overcome without resorting to a great deal of custom code.

HBase/HDFS/Hadoop

HBase is a distributed column-oriented database built on top of HDFS. It is based on Google’s Bigtable database as described in “Bigtable: A Distributed Storage System for Structured Data”. As an implementation of Bigtable, HBase has a number of advantages over MongoDB – write ahead logging, rich ad hoc querying, and redundancy

HBase is not going to suffer from the same node redistribution problems that caused the Foursquare outage. When it comes time to add a new node to the cluster data will be migrated in much larger chunks, one data file at a time. This makes it much easier to add a new data node and redistribute data across the network.

Just like a relational database, HBase is designed so that all data doesn’t need to reside in memory for good performance. The internal data structures are built in a way that makes it very easy to find data and return it to the client. Keys are held in memory and two levels of caching make sure that frequently used data will be in memory when a client requests it.

HBase also has the advantage of using a write ahead log. In the event of a drive failure, it is possible to recover from a backup of your HBase database and play back the log to make sure data is correct and consistent.

If all of this sounds like HBase is designed to be a replacement for an RDBMS, you would be close. HBase is a massively distributed database, just like Bigtable. As a result, data needs to be logged because there is a chance that a hardware node will fail and will need to be recovered. Because HBase is a column-oriented database, we need to be careful not to treat it exactly like a relational database, but

A Relational Database

To be honest, Foursquare’s data load would be trivial in any relational database. SQL Server, Oracle, MySQL, and PostgreSQL can all handle orders of magnitude more data than the 132GB of data that Foursquare was storing at the time of the outage. This begs the question “How we could handle the constant write load?” Foursquare is a write-intensive application.

Typically, in the relational database world, when you need to scale read and write loads we add more disks. There is a finite amount of space in a server chassis and these local disks don’t provide the redundancy necessary for data security and performance; software RAID is also CPU intensive and slow. A better solution is to purchase a dedicated storage device, either a SAN, NAS, or DAS. All of these devices offer read/write caching and can be configured with in a variety of RAID levels for performance and redundancy.

RDBMSes are known quantities – they are easy to scale to certain points. Judging by the amount of data that Foursquare reports to have, they aren’t likely to reach the point where an RDBMS can no longer scale for a very long time. The downside to this approach is that an RDBMS is much costlier per TB of storage (up to ten times more expensive) than using MongoDB, but if your business is your data, then it’s important to keep the data safe.

Conclusion

It’s difficult to say if a different database solution would have prevented the Foursquare outage. But it is a good opportunity to highlight how different data storage systems would respond in the same situation.

I should have written this right when I got back from Hadoop World, instead of a week or so later, but things don’t always happen the way you plan. Before I left to go to Hadoop World (and points in between), I put up a blog post asking for questions about Hadoop. You guys responded with some good questions and I think I owe you answers.

What Is Hadoop?

Hadoop isn’t a simple database; it’s a bunch of different technologies built on top of the Hadoop common utilities, MapReduce, and HDFS (Hadoop Distributed File System). Each of these products serves a simple purpose – HDFS handles storage, MapReduce is a parallel job distribution system, HBase is a distributed database with support for structured tables. You can find out more on the Apache Hadoop page. I’m bailing on this question because 1) it wasn’t asked and 2) I don’t think it’s fair or appropriate for me to regurgitate these answers.

How Do You Install It?

Installing Hadoop is not quite as easy as installing Cassandra. There are two flavors to choose from:

Windows users, keep in mind that Hadoop is not supported on Windows in production. If you want to try it, you’re completely on your own and in uncharted waters.

What Login Security Model(s) Does It Have?

Good question! NoSQL databases are not renowned for their data security.

Back in the beginning, at the dawn of Hadoopery, it was assumed that everyone accessing the system would be a trusted individual operating inside a secured environment. Clearly this happy ideal won’t fly in a modern, litigation fueled world. As a result, Hadoop has support for a Kerberos authentication (now meaning all versions of Hadoop newer than version 0.22 for the Apache Hadoop distribution, Cloudera’s CDH3 distribution, or the 0.20.S Yahoo! Distribution of Hadoop). The Kerberos piece handles the proper authentication, but it is still up to Hadoop and HDFS(it’s a distributed file system, remember?) to make sure that an authenticated user is authorized to mess around with a file.

In short: Kerberos guarantees that I’m Jeremiah Peschka and Hadoop+HDFS guarantee that I can muck with data.

N.B. As of 2010-10-19 (when I’m writing this), Hadoop 0.22 is not available for general use. If you want to run Hadoop 0.22, you’ll have to build from the source control repository yourself. Good luck and godspeed.

When Does It Make Sense To Use Hadoop Instead of SQL Server, Oracle, or DB2?

Or any RDBMS for that matter. Personally, I think an example makes this easier to understand.

Let’s say that we have 1 terabyte of data (the average Hadoop cluster is reported as being 114.5TB in size) and we need to process this data nightly – create reports, analyze trends, detect anomalous data, etc. If we are using an RDBMS, we’d have to batch this data up (to avoid transaction log problems). We would also need to deal with the limitations of parallelism in your OS/RDBMS combination, as well as I/O subsystem limitations (we can only read so much data at one time). SQL dialects are remarkably bad languages for loop and flow control.

If we were using Hadoop for our batch operations, we’d write a MapReduce program (think of it like a query, for now). This MapReduce program would be distributed across all of the nodes in your cluster and then run in parallel using local storage and resources. So instead of hitting a single SAN across 8 or 16 parallel execution threads, we might have 20 commodity servers all processing 1/20 of the data simultaneously. Each server is going to process 50GB. The results will be combined once the job is done and then we’re free to do whatever we want with it – if we’re using SQL Server to store the data in tables for report, we would probably bcp the data into a table.

Another benefit of Hadoop is that the MapReduce functionality is implemented in Java, C, or another imperative programming language. This makes it easy to solve computationally intensive operations in MapReduce programs. There are a large number of programming problems that cannot be easily solved in a SQL dialect; SQL is designed for retrieving data and performing relatively simple transformations on it, not for complex programming tasks.

Hadoop (Hadoop common + HDFS + MapReduce) is great for batch processing. If you need to consume massive amounts of data, it’s the right tool for the job. Hadoop is being used in production for point of sale analysis, fraud detection, machine learning, risk analysis, and fault/failure prediction.

The other consideration is cost: Hadoop is deployed on commodity servers using local disks and no software licensing fees (Linux is free, Hadoop is free). The same thing with an RDBMS is going to involve buying an expensive SAN, a high end server with a lot of RAM, and paying licensing fees to Microsoft, Oracle, or IBM as well as any other vendors involved. In the end, it can cost 10 times less to store 1TB of data in Hadoop than in an RDBMS.

HBase provides the random realtime read/write that you might be used to from the OLTP world. The difference, however, is that this is still NoSQL data access. Tables are large and irregular (much like Google’s Big Table) and there are no complex transactions.

Hive is a data warehouse toolset that sits on top of Hadoop. It supports many of the features that you’re used to seeing in SQL, including a SQL-ish query language that should be easily learned but also provides support for using MapReduce functionality in ad hoc queries.

In short, Hadoop and SQL Server solve different sets of problems. If you need transactional support, complex rule validation and data integrity, use an RDBMS. If you need to process data in parallel, perform batch and ad hoc analysis, or perform computationally expensive transformations then you should look into Hadoop.

What Tools Are Provided to Manage Hadoop?

Unfortunately, there are not a lot of management tools on the market for Hadoop – the only tools I found were supplied by Cloudera. APIs are available to develop your own management tools. From the sound of discussions that I overheard, I think a lot of developers are going down this route – they’re developing monitoring tools that meet their own, internal, needs rather than build general purpose tools that they could sell to third parties. As the core product improves, I’m sure that more vendors will be stepping up to the plate to provide additional tools and support for Hadoop.

Right now, there are a few products on the market that support Hadoop:

Quest’s Toad for Cloud makes it easy to query data stored using HBase and Hive.

Quest’s OraOop is an Oracle database driver for Sqoop – you can think of Sqoop as Hadoop’s equivalent of SQL Server’s bcp program

What Is The State of Write Consistency?

As I mentioned earlier, there is no support for ACID transactions. On the most basic level, Hadoop processes data in bulk from files stored in HDFS. When writes fail, they’re retried until the minimum guaranteed number of replicas is written. This is a built-in part of HDFS, you get this consistency for free just by using Hadoop.

As far as eventual consistency is concerned, Hadoop uses an interesting method to ensure that data is quickly and effectively written to disk. Basically, HDFS finds a place on disk to write the data. Once that write has completed, the data is forwarded to another node in the cluster. If that node fails to write the data, a new node is picked and the data is written until the minimum number of replicas have had data written to them. There are additional routines in place that will attempt to spread the data throughout the data center.

If this seems like an overly simple explanation of how HDFS and Hadoop write data, you’re right. The specifics of write consistency can be found in the HDFS Architecture Guide (which is nowhere near as scary as it sounds).