So, last year, Drizzleparticipated in the Google Summer of Code under the MySQL project organization. We had four excellent student submissions and myself, Monty Taylor, Eric Day and Stewart Smith all mentored students for the summer. It was my second year mentoring, and I really enjoyed it, so I was looking forward to this year’s summer of code.

I have been absolutely floored by the flood of potential students who have shown up on the mailing list and the #drizzle IRC channel. I have been even more impressed with those students’ ambition, sense of community, and willingness to ask questions and help other students as they show up. A couple students have even gotten code contributed to the source trees even before submitting their official applications to GSoC. See, I told you they were ambitious!

This year, Drizzle has a listing of 16 potential projects for students to work on. The projects are for students interested in developing in C++, Python, or Perl.

If you are interested in participating, please do check out Drizzle! For those new to Launchpad, Bazaar, and C++ development with Drizzle, feel free to check out these blog articles which cover those topics:

Today I pushed up the initial patch which adds XA support to Drizzle’s transaction log. So, to give myself a bit of a rest from coding, I’m going to blog a bit about the transaction log and show off some of its features.

WARNING: Please keep in mind that the transaction log module in Drizzle is under heavy development and should not be used in production environments. That said, I’d love to get as much feedback as possible on it, and if you feel like throwing some heavy data at it, that would be awesome

What is the Transaction Log?

Simply put, the transaction log is a record of every modification to the state of the server’s data. It is similar to MySQL’s binlog, with some substantial differences:

The transaction log is a plugin[1]. It lives entirely outside of the Drizzle kernel. The advantage of this is that development of the transaction log does not need to be linked with development in the kernel and versioning of the transaction log can happen independently of the kernel.

Currently, there is only a single log file. MySQL’s binlog can be split into multiple files. This may or may not change in the future.

Drizzle’s transaction log is indexed. Among other things, this means that you can query the transaction log directly from within a Drizzle client via DATA_DICTIONARY views. I will demonstrate this feature below.

It is important to also point out that Drizzle’s transaction log is not required for Drizzle replication. This probably sounds very weird to folks who are accustomed to MySQL replication, which depends on the MySQL binlog. In Drizzle, the replication API is different. Although the transaction log can be used in Drizzle’s replication system, it’s not required. I’ll write more on this in later blog posts which demonstrate how the replication system is not dependent on the transaction log, but in this article I just want to highlight the transaction log module.

How Do I Enable the Transaction Log

First things first, let’s see how we can enable the Transaction Log. If you’ve built Drizzle from source or have installed Drizzle locally, you will be familiar with the process of starting up a Drizzle server. To review, here is how you do so:

cd $basedir
./drizzled [options] &

Where $basedir is the directory you built Drizzle or installed Drizzle. For the [options], typically you will need at the very least a --datadir=$DATADIR and a --mysql-protocol-port=$PORT value. For an explanation of the --mysql-protocol-port option, see Eric Day‘s recent article.

To demonstrate, I’ve built a Drizzle server in a local directory of mine, and I’ll use the /tests/var/ directory as my $datadir:

Now let’s start up the server, this time passing the --transaction-log-enable and the --default-replicator-enable options. The --default-replicator-enable option is needed when the transaction log is not in XA mode (more on that later):

Let’s see what each of the views tells us about what is in the transaction log. Remember, we’ve executed a CREATE SCHEMA, a CREATE TABLE, and a single INSERT. Here is what the TRANSACTION_LOG view shows:

The column names should be self explanatory. The FILE_LENGTH shows the size in bytes of the log (which matches the output we had from our ls -lha above.) The INDEX_SIZE_IN_BYTES is total amount of memory allocated for the transaction log index.

The TRANSACTION_LOG_ENTRIES view isn’t that interesting at first glance:

You might be tempted to ask what the heck the purpose of the TRANSACTION_LOG_ENTRIES view is for. It is a bit of a bridge table that allows one to see the type of entries at each offset. Currently, the only types of entries in the transaction log are of type TRANSACTION — basically a serialized GPB Protobuffer message — and a BLOB entry, which is for storage of large blob data.

The TRANSACTION_LOG_TRANSACTIONS view shows all the transaction log entries which are of type TRANSACTION:

As you can see, there is some basic information about each transaction entry in the log, including the offset in the transaction log, the start and end timestamp of the transaction, it’s transaction identifier, the number of statements involved in the transaction, and an optional checksum for the message (more on checksums below).

Viewing the Transaction Content

While the above view output may be nice, what we’d really like to be able to do is see what precisely were the changes a Transaction effected. To see this, we can use the PRINT_TRANSACTION_MESSAGE(log_file, offset) UDF. Below, I’ve added two more rows to the lebowski.characters table within an explicit transaction. I then query the DATA_DICTIONARY views using the PRINT_TRANSACTION_MESSAGE() function to show the changes logged to the transaction log:

You may notice that NUM_STATEMENTS is equal to 1 even though there were 2 INSERT statements issued. This is because the kernel packages both the INSERTs into a single message::Statement::InsertData package for more efficient storage. If there had been an INSERT and an UPDATE, NUM_STATEMENTS would be 2.

Enable Automatic Checksumming

One final feature I’ll highlight in this blog post is an option to automatically store a checksum of each transaction message when writing entries to the transaction log. To enable this feature, simply use the --transaction-log-enable-checksum command line option. You can view the checksums of entries in the TRANSACTION_LOG_TRANSACTIONS view, as demonstrated below:

DDL is not Statement-based Replication

As a final note, I’d like to point out that even DDL in Drizzle is replicated as row-based transaction messages, and not as raw SQL statements like in MySQL. You can see, for instance, the message::Statement::CreateTableStatement inside the transaction message which contains all the metadata about the table you just created.

Over the past six weeks or so, I have been working on cleaning up the pluggable storage engine API in Drizzle. I’d like to describe some of this work and talk a bit about the next steps I’m taking in the coming months as we roll towards implementing Log Shipping in Drizzle.

First, how did it come about that I started working on the storage engine API?

From Commands to Transactions

Well, it really goes back to my work on Drizzle’s replication system. I had implemented a simple, fast, and extensible log which stored records of the data changes made to a server. Originally, the log was called the Command Log, because the Google Protobuffer messages it contained were called message::Commands. The API for implementing replication plugins was very simple and within a month or so of debuting the API, quite a few replication plugins had been built, including one replicating to Memcached, a prototype one replicating to Gearman, and a filtering replicator plugin.

In addition, Marcus Eriksson had created the RabbitReplication project which could replicate from Drizzle to other data stores, including Cassandra and Project Voldemort. However, Marcus did not actually implement any C/C++ plugins using the Drizzle replication API. Instead, RabbitReplication simply read the new Command Log, which due to it simply being a file full of Google Protobuffer messages, was quick and easy to read into memory using a variety of different programming languages. RabbitReplication is written in Java, and it was great to see other programming languages be able to read Drizzle’s replication log so easily. Marcus later coded up a C++ TransactionApplier plugin which replaces the Drizzle replication log and instead replicates the GPB messages directly to RabbitMQ.

And there, you’ll note that one of the plugins involved in Drizzle’s replication system is called TransactionApplier. It used to be called CommandApplier. That was because the GPB Command messages were individual row change events for the most part. However, I made a series of changes to the replication API and now the GPB messages sent through the APIs are of class message::Transaction. message::Transaction objects contain a transaction context, with information about the transaction’s start and end time, it’s transaction identifer, along with a series of message::Statement objects, each of which representing a part of the data changes that the SQL transaction made.

Thus, the Command Log now turned into the Transaction Log, and everywhere the term Command was used now was replaced with the terms Transaction and Statement (depending on whether you were talking about the entire Transaction or a piece of it). Log entries were now written at COMMIT to the Transaction Log and were not written if no COMMIT occurred1.

After finishing this work to make the transaction log write Transaction messages at commit time, I was keen to begin coding up the publisher and subscriber plugins which represent a node in the replication environment. However, Brian had asked me to delay working on other replication features and ensure that the replication API could support fully distributed transactions via the X/Open XA distributed transaction protocol. XA support had been removed from Drizzle when the MySQL binlog and original replication system was ripped out and needed some TLC. Fair enough, I said. So, off I went to work on XA.

If Only It Were Simple…

As anyone who has worked on the MySQL source code or developed storage engines for MySQL knows, working with the MySQL pluggable storage engine API is sometimes not the easiest or most straightforward thing. I think the biggest problem with the MySQL storage engine API is that, due to understandable historical reasons, it’s an API that was designed with the MyISAM and HEAP storage engines in mind. Much of the transactional pieces of the API seem to be a bolted-on afterthought and can be very confusing to work with.

As an example, Paul McCullagh, developer of the transactional storage engine PBXT, recently emailed the mysql internals mailing list asking how the storage engine could tell when a SQL statement started and ended. You would think that such a seemingly basic functionality would have a simple answer. You’d be wrong. Monty Wideniusanswered like this:

Why not simply have a counter in your transaction object for how start_stmt – reset(); When this is 0 then you know stmnt ended.

In Maria we count number of calls to external_lock() and when the sum goes to 0 we know the transaction has ended.

MySQL never kept a count of which handlers are used by a transaction, only which tables.

So the original logic was that external_lock(lock/unlock) is called for each usage of the table, which is normally more than enough information for a handler to know when a statement starts/ends.

The one case this didn’t work was in the case someone does lock tables as then external_lock is not called per statement. It was to satisfy this case that we added a call to start_stmt() for each table.

It’s of course possible to change things so that start_stmt() / end_stmt() would be called once per used handler, but this would be yet another overhead for the upper level to do which the current handlers that tracks call to external_lock() doesn’t need.

Well, in Drizzle-land, we aren’t beholden to “historic reasons” So, after looking through the in-need-of-attention transaction processing code in the kernel, I decided that I would clean up the API so that storage engines did not have to jump through hoops to notify the kernel they participate in a transaction or just to figure out when a statement and a transaction started and ended.

The resulting changes to the API are quite dramatic I think, but I’ll leave it to the storage engine developers to tell me if the changes are good or not. The following is a summary of the changes to the storage engine API that I committed in the last few weeks.

plugin::StorageEngine Split Into Subclasses

The very first thing I did was to split the enormous base plugin class for a storage engine, plugin::StorageEngine, into two other subclasses containing transactional elements. plugin::TransactionalStorageEngine is now the base class for all storage engines which implement SQL transactions:

/**
* A type of storage engine which supports SQL transactions.
*
* This class adds the SQL transactional API to the regular
* storage engine. In other words, it adds support for the
* following SQL statements:
*
* START TRANSACTION;
* COMMIT;
* ROLLBACK;
* ROLLBACK TO SAVEPOINT;
* SET SAVEPOINT;
* RELEASE SAVEPOINT;
*/class TransactionalStorageEngine :public StorageEngine
{public:
TransactionalStorageEngine(const std::string name_arg,
const std::bitset<HTON_BIT_SIZE>&flags_arg= HTON_NO_FLAGS);virtual ~TransactionalStorageEngine();
...
private:void setTransactionReadWrite(Session& session);/*
* Indicates to a storage engine the start of a
* new SQL transaction. This is called ONLY in the following
* scenarios:
*
* 1) An explicit BEGIN WORK/START TRANSACTION is called
* 2) After an explicit COMMIT AND CHAIN is called
* 3) After an explicit ROLLBACK AND RELEASE is called
* 4) When in AUTOCOMMIT mode and directly before a new
* SQL statement is started.
*/virtualint doStartTransaction(Session *session, start_transaction_option_t options){(void) session;(void) options;return0;}/**
* Implementing classes should override these to provide savepoint
* functionality.
*/virtualint doSetSavepoint(Session *session, NamedSavepoint &savepoint)=0;virtualint doRollbackToSavepoint(Session *session, NamedSavepoint &savepoint)=0;virtualint doReleaseSavepoint(Session *session, NamedSavepoint &savepoint)=0;/**
* Commits either the "statement transaction" or the "normal transaction".
*
* @param[in] The Session
* @param[in] true if it's a real commit, that makes persistent changes
* false if it's not in fact a commit but an end of the
* statement that is part of the transaction.
* @note
*
* 'normal_transaction' is also false in auto-commit mode where 'end of statement'
* and 'real commit' mean the same event.
*/virtualint doCommit(Session *session, bool normal_transaction)=0;/**
* Rolls back either the "statement transaction" or the "normal transaction".
*
* @param[in] The Session
* @param[in] true if it's a real commit, that makes persistent changes
* false if it's not in fact a commit but an end of the
* statement that is part of the transaction.
* @note
*
* 'normal_transaction' is also false in auto-commit mode where 'end of statement'
* and 'real commit' mean the same event.
*/virtualint doRollback(Session *session, bool normal_transaction)=0;virtualint doReleaseTemporaryLatches(Session *session){(void) session;return0;}virtualint doStartConsistentSnapshot(Session *session){(void) session;return0;}};

As you can see, plugin::TransactionalStorageEngine inherits from plugin::StorageEngine and extends it with a series of private pure virtual methods that implement the SQL transaction parts of a query — doCommit(), doRollback(), etc. Implementing classes simply inherit from plugin::TransactionalStorageEngine and implement their internal transaction processing in these private methods.

In addition to the SQL transaction, however, is the concept of an XA transaction, which is for distributed transaction coordination. The XA protocol is a two-phase commit protocol because it implements a PREPARE step before a COMMIT occurs. This XA API is exposed via two other classes, plugin::XaResourceManager and plugin::XaStorageEngine. plugin::XaResourceManager derived classes implement the resource manager API of the XA protocol. plugin::XaStorageEngine is a storage engine subclass which, while also implementing SQL transactions, also implements XA transactions.

Here is the plugin::XaResourceManager class:

/**
* An abstract interface class which exposes the participation
* of implementing classes in distributed transactions in the XA protocol.
*/class XaResourceManager
{public:
XaResourceManager(){}virtual ~XaResourceManager(){}
...
private:/**
* Does the COMMIT stage of the two-phase commit.
*/virtualint doXaCommit(Session *session, bool normal_transaction)=0;/**
* Does the ROLLBACK stage of the two-phase commit.
*/virtualint doXaRollback(Session *session, bool normal_transaction)=0;/**
* Does the PREPARE stage of the two-phase commit.
*/virtualint doXaPrepare(Session *session, bool normal_transaction)=0;/**
* Rolls back a transaction identified by a XID.
*/virtualint doXaRollbackXid(XID *xid)=0;/**
* Commits a transaction identified by a XID.
*/virtualint doXaCommitXid(XID *xid)=0;/**
* Notifies the transaction manager of any transactions
* which had been marked prepared but not committed at
* crash time or that have been heurtistically completed
* by the storage engine.
*
* @param[out] Reference to a vector of XIDs to add to
*
* @retval
* Returns the number of transactions left to recover
* for this engine.
*/virtualint doXaRecover(XID * append_to, size_t len)=0;};

Pretty clear. A plugin::XaStorageEngine inherits from both plugin::TransactionStorageEngine and plugin::XaResourceManager because it implements both SQL transactions and XA transactions. The InnobaseEngine is a plugin which inherits from plugin::XaStorageEngine because InnoDB supports SQL transactions as well as XA.

Explicit Statement and Transaction Boundaries

The second major change I made addressed the problem that Mark Callaghan noted in asking why finding out when a statement starts and ends was so obscure. I added two new methods to plugin::StorageEngine called doStartStatement() and doEndStatement(). The kernel now explicitly tells storage engines when a SQL statement starts and ends. This happens before any calls to Cursor::external_lock() happen, and there are no exception cases. In addition, the kernel now always tells transactional storage engines when a new SQL transaction is starting. It does this via an explicit call to plugin::TransactionalStorageEngine::doStartTransaction(). No exceptions, and yes, even for DDL operations.

What this means is that for a transactional storage engine, it no longer needs to “count the calls to Cursor::external_lock()” in order to know when a statement or transaction starts and ends. For a SQL transaction, this means that there is a clear code call path and there is no need for the storage engine to track whether the session is in AUTOCOMMIT mode or not. The kernel does all that work for the storage engine. Imagine a Session executes a single INSERT statement against an InnoDB table while in AUTOCOMMIT mode. This is what the call path looks like:

No More Need for Engine to Call trans_register_ha()

The server has no way to know that an engine participates in
the statement and a transaction has been started
in it unless the engine says so. Thus, in order to be
a part of a transaction, the engine must “register” itself.
This is done by invoking trans_register_ha() server call.
Normally the engine registers itself whenever handler::external_lock()
is called. trans_register_ha() can be invoked many times: if
an engine is already registered, the call does nothing.
In case autocommit is not set, the engine must register itself
twice — both in the statement list and in the normal transaction
list.

That comment, and I’ve read it dozens of times, always seemed strange to me. I mean, does the server really not know that an engine participates in a statement or transaction unless the engine tells it? Of course not.

So, I removed the need for a storage engine to “register itself” with the kernel. Now, the transaction manager inside the Drizzle kernel (implemented in the TransactionServices component) automatically monitors which engines are participating in an SQL transaction and the engine doesn’t need to do anything to register itself.

In addition, due to the break-up of the plugin::StorageEngine class and the XA API into plugin::XaResourceManager, Drizzle’s transaction manager can now coordinate XA transactions from plugins other than storage engines. Yep, that’s right. Any plugin which implements plugin::XaResourceManager can participate in an XA transaction and Drizzle will act as the transaction manager. What’s the first plugin that will do this? Drizzle’s transaction log. The transaction log isn’t a storage engine, but it is able to participate in an XA transaction, so it will implement plugin::XaResourceManager but not plugin::StorageEngine.

Performance Impact of Code Changes

So, that “yet another overhead” Monty talked about in the quote above? There wasn’t any noticeable impact in performance or scalability at all. So much for optimize-first coding.

What’s Next?

The next thing I’m working on is removing the notion of the “statement transaction”, which is also a historical by-product, this time because of BerkeleyDB. Gee, I’ve got a lot of work ahead of me…

[1] Actually, there is a way that a transaction that was rolled back can get written to the transaction log. For bulk operations, the server can cut a Transaction message into multiple segments, and if the SQL transaction is rolled back, a special RollbackStatement message is written to the transaction log.

Although a few folks knew about where I and many of the Sun Drizzle team had ended up, we’ve waited until today to “officially” tell folks what’s up. We — Monty Taylor, Eric Day, Stewart Smith, Lee Bieber, and myself — are all now “Rackers”, working at Rackspace Cloud. And yep, we’re still workin’ on Drizzle. That’s the short story. Read on for the longer one

An Interesting Almost 3 Years at MySQL

I left my previous position of Community Relations Manager at MySQL to begin working on Brian Aker‘s newfangled Drizzle project in October 2008.

Many people at MySQL still think that I abandoned MySQL when I did so. I did not. I merely had gotten frustrated with the slow pace of change in the MySQL engineering department and its resistance to transparency. Sure, over the 3 years I was at MySQL, the engineering department opened up a bit, but it was far from the ideal level of transparency I had hoped to inspire when I joined MySQL.

For almost 3 years, I had sent numerous emails to the MySQL internal email discussion lists asking the engineering and marketing departments, both headed by Zack Urlocker, to recognize the importance and necessity of major refactoring of the MySQL kernel, and the need to modularize the kernel or risk having more modular databases overtake MySQL as the key web infrastructure database. The focus was always on the short term; on keeping up with the Jones’ as far as features went, and I railed against this kind of roadmap, instead pushing the idea of breaking up the server into modules that could be blackboxed and developed independently of the kernel. My ideas were met with mostly kind responses, but nothing ever materialized as far as major refactoring efforts were concerned.

I remember Jim Winstead casually responding to one of my emails, “Congratulations, you’ve just reinvented Apache 2.0″. And, yes, Jim, that was kind of the point…

The MySQL source code base had gotten increasingly unmaintainable over the years, and key engineers were extremely resistant to changing the internals of MySQL and modernizing it. There were some good reasons for being resistant, and some poor reasons (such as “this is the way we’ve always done it”). Overall, it’s tough to question the strategy that Zack, Marten Mickos, and others had regarding the short term gains. After all, they managed to maneuver MySQL into a winning position that Sun Microsystems thought was worth one billion dollars. Because of this, it’s tough to argue with them.

Working on Drizzle since October 2008 (officially)

I’m not the kind of person which likes to wait for years to see change, and so the Drizzle project interested me because it was not concerned with backwards compatibility with MySQL, it wasn’t concerned with having a roadmap that was dependent on the whims of a few big customers, and it was very much interested in challenging the assumptions built into a 20 year-old code base. This is a project I could sink my teeth into. And I did.

Many folks have said that the only reason Drizzle is still around is because Sun continued to pay for a number of engineers to work on Drizzle as “an experiment of sorts” and that Drizzle has no customers and therefore nothing to lose and everything to gain. This was true, no doubt about it. At Sun CTO Labs, the few of us did have the ability to code on Drizzle without the pressure-cooker of product marketing and sales demands. We were lucky.

469 10 Months in Purgatory

So, around rolls April 2009. The stock market and worldwide economy had collapsed and recession was in the air. There’s one thing that is absolutely certain in recession economies: companies that have poor leadership and direction and are beholden to the interests of a large stockholder will seek an end to their misery through acquisition by a larger, stronger firm.

And Sun Microsystems was no different. JAVA stock plummeted to two dollars a share, and Jonathan Schwartz and the Sun board began shopping Sun around to the highest bidder. IBM was courted along with other tech giants. So was Oracle.

And it was with a bit of a hangover that I awoke at the MySQL conference in April 2009 to the news that Oracle had purchased Sun Microsystems. Joy. We’d just gone through 14 months of ongoing integration with Sun Microsystems and now it was going to start all over again.

Anyone who follows PlanetMySQL knows about the ensuing battle in the European Commission’s court regarding monopoly of Oracle in the database market with its acquisition of MySQL. Monty Widenius, Eben Moglen, even Richard Stallman, weighed in on the pros and cons of Oracle’s impending control over MySQL.

All the while, us Sun Microsystems employees had to hold our tongues and try to keep our jobs as Sun laid off thousands more workers while the EC battle ensued. Not fun. It was the employment equivalent of purgatory. And the time just dragged on, with many employees, including myself and the Sun Drizzle team, not having a clue as to what would happen to us. Management was completely silent about future plans. Oracle made zero attempts to outline its future strategy regarding software, and thus most software employees simply kept on doing their work not knowing if the pink slip was arriving tomorrow or not. Lots of fun that was.

Oracle Doesn’t Need Our Services — Larry Don’t Need No Stinkin’ Cloud

The acquisition finally closed and very shortly afterwards, I got a call from my boss, Lee Bieber, that Oracle wouldn’t be needing our services. Monty, Eric, and Stewart had already resigned; none of them had any desire to work for Oracle. Lee and I had decided to see what they had in mind for us. Apparently, not much.

Larry Ellison has gone on record that the whole “cloud thing” is faddish. I don’t know whether Larry understands that cloud computing and infrastructure-as-a-service, platform-as-a-service, and database-as-a-service will eventually put his beloved Oracle cash cow in its place or not. I don’t know whether Oracle is planning on embracing the cloud environments which will continue to eat up the market share of more traditional in-house environments upon which their revenue streams depend. I really don’t.

But what I do know is that Rackspace is betting that providing these services is what the future of technology will be about.

Happiness is a Warm Cloud

Our team has landed at Rackspace Cloud. I’ve now been down to San Antonio twice to meet with key individuals with whom we’ll be working closely. Rackspace is not shy about why the wanted to acquire our team. They see Drizzle as a database that will provide them an infrastructure piece that will be modular and scalable enough to meet the needs of their very diverse Cloud customers, of which there are many tens of thousands.

Rackspace recognizes that the pain points they feel with traditional MySQL cannot be solved with simple hacks and workarounds, and that to service the needs of so many customers, they will need a database server that thinks of itself as a friendly piece of their infrastructure and not the driver of its applications. Drizzle’s core principles of flexibility and focus on scalability align with the goals Rackspace Cloud has for its platform’s future.

Rackspace is also heavily invested in Cassandra, and sees integration of Drizzle and Cassandra as being a key way to add value to its platforms and therefore for its customers.

Rackspace is all about the customers, and this is a really cool thing to experience. It’s typical for companies to claim they are all about the customer — in fact, every company I’ve ever worked for has claimed this. Rackspace is the first company I’ve worked for where you actually feel this spirit, though. You can see the fanaticism of Rackers and how they view what they do always in terms of service to the customer. It’s infectious, and I’m pretty psyched to be on their team.

Anyway, that’s my story and I’m stickin’ to it. See y’all on the nets.

I’ve been coding up a storm in the last couple days and have just about completed coding on three new INFORMATION_SCHEMA views which allow anyone to query the new Drizzle transaction log for information about its contents. I’ve also finished a new UDF for Drizzle called PRINT_TRANSACTION_MESSAGE() that prints out the Transaction message‘s contents in a easy-to-read format.

I don’t have time for a full walk-through blog entry about it, so I’ll just paste some output below and let y’all take a looksie. A later blog entry will feature lots of source code explaining how you, too, can easily add INFORMATION_SCHEMA views to your Drizzle plugins.

Below is the results of the following sequence of actions:

Start up a Drizzle server with the transaction log enabled, checksumming enabled, and the default replicator enabled.

Open a Drizzle client

Create a sample table, insert some data into it, do an update to that table, then drop the table

Query the INFORMATION_SCHEMA views and take a look at the transaction messages and information the transaction log now contains

This week, I am working on putting together test cases which validate the Drizzle transaction log‘s handling of BLOB columns.

I ran into an interesting set of problems and am wondering how to go about handling them. Perhaps the LazyWeb will have some solutions.

The problem, in short, is inconsistency in the way that the NUL character is escaped (or not escaped) in both the MySQL/Drizzle protocol and the MySQL/Drizzle client tools. And, by client tools, I mean both everyone’s favourite little mysql command-line client, but also the mysqltest client, which provides infrastructure and runtime services for the MySQL and Drizzle test suites.

Even within the server and client protocol, there appears to be some inconsistency in how and when things are escaped. Take a look at this interesting output from the drizzle client program (FYI, output is identical for mysql client, I checked…)

You’ll notice that in the first SELECT statement, the column header is cut off — i.e. the column header is not escaping the \0 NUL character in the string 'test\0me'. However, the result data does not truncate the string but replaces the NUL character with a space character. So, I came to the conclusion that the drizzle client does not escape column headers but does do some sort of escaping for the result data. Given this conclusion, you will understand my raised eyebrow when the following SELECT statement was displayed:

Hmmm…so maybe column headers are being escaped by the MySQL/Drizzle client? Clearly, the NUL character was escaped as the characters ‘\\’ followed by the character ‘0’ in the column header above. Indeed, quite puzzling.

OK, so the above anomaly needs to be investigated. However, a similar issue exists for the mysqltest/drizzletest client program. To see the problem, check the following out. I create a simple test case with the following in it:

That is what you would expect to see in the output of course… Here is what you actually get in the output:

DROP TABLE IF EXISTS t1;
SELECT 'test\0me';
test
test

So, the mysqltest/drizzletest client apparently does not escape the NUL character for the result data at all. It looks like it does do some escaping/replacing for the NUL character in the column header, though, otherwise the second “test” line would not appear. This leads to the result file being essentially truncated as soon as a NUL character is included in any output to the mysqltest/drizzletest client. This essentially makes the mysqltest/drizzletest client useless for testing and validating BLOB data.

Possible Solutions?

I think the cleanest solution would be to create a shared library of code that would be responsible for uniformly and consistently escaping data, and then linking the various clients (and server) with this library and removing all of the various escaping functions currently in the server. This would, of course, take some time, but would be the most future proof solution. Anyone else have ideas on solving the problem of being able to test and validate binary data via the test suite? Cheers!

In this installment of my Drizzle Replication blog series, I’ll be talking about the Transaction Log. Before reading this entry, you may want to first read up on the Transaction Message, which is a central concept to this blog entry.

The transaction log is just one component of Drizzle’s default replication services, but it also serves as a generalized log of atomic data changes to a particular server. In this way, it is only partially related to replication. The transaction log is used by components of the replication services to store changes made to a server’s data. However, there is nothing that mandates that this particular transaction log be a required feature for Drizzle replication systems. For instance, Eric Lambert is currently working on a Gearman-based replication service which, while following the same APIs, does not require the transaction log to function. Furthermore, other, non-replication-related modules may use the transaction log themselves. For instance, a future Recovery and/or Backup module may just as easily use the transaction log for its own purposes as well.

Before we get into the details, it’s worth noting the general goals we’ve had for the transaction log, as these goals may help explain some of the design choices made. In short, the goals for the transaction log are:

Introduce no global contention points (mutexes/locks)

Once written, the transaction log may not be modified

The transaction log should be easily readable in multiple programming languages

Overview of the Transaction Log Structure

The format of the transaction log is simple and straightforward. It is a single file that contains log entries, one after another. These log entries have a type associated with them. Currently, there are only two types of entries that can go in the transaction log: a Transaction message entry and a BLOB entry. We will only cover the Transaction message entry in this article, as I’ll leave how to deal with BLOBs for a separate article entirely.

Each entry in the transaction log is preceded by a 4 bytes containing an integer code identifying the type of entry to follow. The bytes which follow this type header are interpreted based on the type of entry. For entries of type Transaction message, the graphics here show the layout of the entry in the log. First, a 4 byte length header is written, then the serialized Transaction message, then a 4 byte checksum of the serialized Transaction message.

Details of the TransactionLog::apply() Method

For those interested in how the transaction log is written to, I’m going to detail the apply() method of the TransactionLog class in /plugin/transaction_log/transaction_log.cc. The TransactionLog class is simply a subclass of plugin::TransactionApplier and therefore must implement the single pure virtual apply method of that class interface.

The TransactionLog class has a private drizzled::atomic<off_t> called log_offset which is an offset into the transaction log file that is incremented with each atomic write to the log file. You will notice in the code below that this atomic off_t is stored locally, then incremented by the total length of the log entry to be written. A buffer is then written to the log file using pwrite() at the original offset. In this way, we completely avoid calling pthread_mutex_lock() or similar when writing to the log file, which should increase scalability of the transaction log.

Reading the Transaction Log

OK, so the above code shows how the transaction log is written. What about reading the log file? Well, it’s pretty simple. There is an example program in /drizzle/message/transaction_reader.cc which has code showing how to do this. Here’s a snippet from that program:

Shortcomings of the Transaction Log

So far, we’ve generally focused on a scalable design for the transaction log and have not spent too much time on performance tuning the code — and yes, performance != scalability. There are a number of problems with the current code which we will address in future versions of the transaction log. Namely:

Reduce calls to malloc(). Currently, each write of a transaction message to the log file incurs a call to malloc() to allocate enough memory to store the serialized log entry. Clearly, this is not optimal. We’ve considered a number of alternate approached to calling malloc(), including having a scoreboard approach where a vector of memory slabs are used in a round-robin fashion. This would introduce some locking, however. Also, I’ve thought about using a hazard pointer list on the Session object to have previously-allocated memory on the Session object be used for something like this. But, these ideas must be hashed out further.

There is no index into the transaction log. This is not a problem for writing the transaction log, of course, but for readers of the transaction log. I’m in the process of creating classes and a library for building indexes for a transaction log and, in addition, creating archived snapshots to enable log shipping for Drizzle replication. I’ll be pushing code for this to Launchpad later this week and will write a new article about log shipping and snapshot creation.

Each call to apply() calls fdatasync()/fsync() on the transaction log. Certain environments may consider this to be too strict a sync requirement, since the storage engine may already keep a transaction log file of its own that is also synced. For instance, InnoDB has a transaction log that, depending on the setting of InnoDB configuration variables, may call fdatasync() upon every transaction commit. It would be best to have the syncing behaviour be user-adjustable — for instance, a setting to allow the transaction log to be synced every X number of seconds…

Summary and Request for Comments

That’s it for the discussion about the transaction log. I’ll post some more code examples from the replication plugins which utilize the transaction log in a later blog entry.

What do you think of the design of the transaction log? What would you change? Comments are always welcome! Cheers.

Hi all. It’s been quite some time since my last article on the new replication system in Drizzle. My apologies for the delay in publishing the next article in the replication series.

The delay has been due to a reworking of the replication system to fully support “group commit” behaviour and to support fully transactional replication. The changes allow replicator and applier plugins to understand much more about the actual changes which occurred on the server, and to understand the transactional container properly.

Make replication modular and not dependent on one particular implementation

Make it simple and fun to develop plugins for Drizzle replication

Encapsulate all transmitted information in an efficient, portable, and standard format

This article serves to build on the last article and explain the changes to the Google Protobuffer message definitions used in the replication API. The actual replication API described in the last article remains almost the same. However, instead of being named CommandApplier and CommandReplicator, those plugin base classes are now named TransactionApplier and TransactionReplicator respectively. And, instead of consuming a Command message, they consume Transaction messages.

For my friend Edwin‘s benefit, I’ll be including lots of pretty graphics. For my developer readers, I’m including lots of example C++ code to help you best understand how to read and manipulate the Transaction and Statement messages in the new replication system.

The Command Message has become the Statement message, and a new Transaction message serves as a container for multiple Statement messages representing (for most cases) an atomic change in the state of the database server. I’ll discuss later in the article those specific cases where a Transaction message’s contents may contain only a partial atomic change to the server.

The image to the right depicts the Transaction message container. As you can see, the Transaction message contains two things: a TransactionContext message and an array of one or more Statement messages.

The TransactionContext Message

Each Transaction message contains a single TransactionContext message. The TransactionContext message contains information about the entire transaction. The data members of the TransactionContext are as follows:

server_id – (uint32_t) A numeric identifier for the server which executed this transaction

transaction_id – (uint64_t) A globally-unique transaction identifier

start_timestamp – (uint64_t) A nano-second precision timestamp of when the transaction began.

end_timestamp – (uint64_t) A nano-second precision timestamp of when the transaction completed.

Since TransactionContext is simply a Google Protobuffermessage, accessing data members is simple and straightforward. If you’re writing a replicator or applier, a reference to a const Transaction message will be supplied to you via the standard API. For instance, let’s assume we’re writing a replicator and we want to filter all messages that are from the server with a server_id of 100. Kind of a silly example, but nevertheless, it allows us to see some example code.

As you may remember, the API for a replicator is dirt simple. There is a replicate() pure virtual method which accepts two parameters, the GPB message and a reference to the Applier which will “apply” the message to some target. The new function signature is the same as the last one, with the term “Command” replaced with the term “Transaction”:

virtualvoid replicate(TransactionApplier *in_applier,

message::Transaction &to_replicate)= 0;

Suppose our replicator class is called MyReplicator. Here is how to query the transaction context of the Transaction message and filter out transactions coming from server #100.

See? Pretty darn simple. OK, on to the Statement message, which is slightly more complicated.

The Statement Message

As noted above, the Transaction message contains an array of Statement messages. In Protobuffer terminology, the Transaction message contains a “repeated” Statement data member. The Statement message is an envelope containing the following information:

type – (enum Type) The type of Statement this message represents. Currently, the possible values of the type are as follows:

ROLLBACK

INSERT

UPDATE

DELETE

TRUNCATE_TABLE

CREATE_SCHEMA

ALTER_SCHEMA

DROP_SCHEMA

CREATE_TABLE

ALTER_TABLE

DROP_TABLE

SET_VARIABLE

RAW_SQL

start_timestamp – (uint64_t) A nano-second precision timestamp of when the statement began.

end_timestamp – (uint64_t) A nano-second precision timestamp of when the statement completed.

For certain types of Statement messages, there will also be a specialized header and data message (see below).

To access the Statement messages in a Transaction, use something like the following code, which loops over the Transaction message’s vector of Statement messages:

void MyReplicator::replicate(TransactionApplier *in_applier,

message::Transaction &to_replicate)

{

…

/* Grab the number of statements in the Transaction message */

size_t x;

size_t num_statements= to_replicate.statement_size();

/* Do something with each statement… */

for(x= 0; x < num_statements; ++x)

{

const message::Statement &stmt= to_replicate.statement(x);

/* processStatement() does something with the statement… */

processStatement(stmt);

}

…

}

Serialized Polymorphism with the type Member

The type data member is of critical importance to the Statement message, as it allows us to have a sort of polymorphism serialized within the Statement message itself. This polymorphism allows the generic Statement message to contain specialized submessages depending on what type of event occurred on the server.

The above paragraph probably sounds overly complicated, but in reality things are pretty simple. As usual, it’s easiest to see what’s going on by looking at an example in code. For our example, let’s build out our fictional processStatement() method from the snippet above.

The processStatement() method is basically a giant switch statement, switching off of the supplied Statement message parameter’s type data member property. Here is the outline of the processStatement() method, with only our switch statement and some comments visible which should give you an idea of how we deal with specific types of Statements:

void processStatement(const message::Statement &stmt)

{

switch(stmt.type())

{

case message::Statement::INSERT:

/* Handle statements which insert new data… */

break;

case message::Statement::UPDATE:

/* Handle statements which update existing data… */

break;

case message::Statement::DELETE:

/* Handle statements which delete existing data… */

break;

…

}

}

Let’s go ahead and “fill out” one of the case blocks in the switch statement above. We will handle the case where the Statement type is INSERT. Note that this does not necessarily mean a SQL INSERT statement was executed. All this means is that an SQL statement was executed which resulted in a new record being added to a table on the server. This means that the actual SQL statement could have been any of INSERT, INSERT … SELECT, REPLACE INTO, or LOAD DATA INFILE.

The /drizzled/message/transaction.proto file will always contain lots of documentation explaining how each of the specific submessages in the Statement message class are handled. To the right is a graphic depicting the InsertHeader and InsertData message classes which compose the “meat” of Statements that inserted new records into the database. Whenever the Statement message’s type is INSERT, the Statement message will contain two submessages, one called insert_header and another called insert_data which will be populated with the InsertHeader and InsertData messages. The header message will contain information about the table and fields affected, while the data message will contain the values to be inserted into the table.

Here is some example code which queries the header and data messages and constructs an SQL string from them:

The example code above is far from production-ready, of course. I don’t take into account different field types, instead simply enclosing everything in single quotes. Also, I don’t handle errors or escaping strings. The point isn’t to be perfect, but to show you the general way to get information out of the Statement message…

Partial Atomic Transactions

Above, I stated that the Transaction messages sent to Replicators and Appliers represent an atomic change to the state of a server. This is true, most of the time. There are specific situations when a Transaction message will not represent an atomic change, and you should be aware of these scenarios if you plan to write plugins which implement a replication scheme.

There are times when it is simply inefficient or impossible to create a Transaction message that represents the actual atomic change on a server. For instance, imagine a table having 100 million records. Now, imagine issuing an UPDATE against that table that potentially affected every row in the table.

In order to transmit to replicas the atomic change to the server, one gigantic Transaction message would need to be constructed on the master server. Not only is there a distinct chance that the master would run out of memory constructing such a large message object, but it’s safe to say that the master server would suffer from performance degradation during this construction. There must, therefore, be a way to start streaming the changes made to the master server before the actual final commit has happened on the master.

You may have noticed two data members of the InsertData message above named segment_id and end_segment. The first is of type uint32_t and the second is a bool. Together, these two data members fulfill the need to transmit transaction messages that are part of a bulk data modification. When a reader of a Transaction message sees that the end_segment data member is false, then the reader knows that another data segment will follow the current data message and will contain more inserts, updates, or deletes for the current transaction.

Summary and Request for Comments

Hopefully, I’ve explained the changes that have been made to Drizzle’s replication system well enough above, but I understand the changes to the message definitions are substantial and am available at any time to discuss the changes and assist people with their code. You can find me on IRC, Freenode’s #drizzle channel, via the Drizzle discussion mailing list, or via email joinfu@sun.com. I very much welcome comments. The new replication system is just finishing up the valgrind regression tests and should hit trunk later today.

The next article covers the new Transaction Log, which is a serialized log of the Transaction messages used in the replication system.

Sometimes, as Sergei rightly mentioned, I can be, well, “righteously indignant” about what I perceive to be a hack.

In this case, after Sergei repeatedly tried to set me straight about what was going on “under the covers” during a REPLACE operation, I was still arguing that he was incorrect.

Doh.

I then realized that Sarah Sproenhle’s original comment about my test table not having a primary key was the reason that I was seeing the behaviour that I had been seeing.

My original test case was failing, expecting to see a DELETE + an INSERT, when a REPLACE INTO was issued against a table. When I placed the PRIMARY KEY on the table in my test case and re-ran the test case, it still failed because the DELETE still was not in the transaction log. Well, it turns out that the reason was because ha_update_row() was actually called and not ha_delete_row() + ha_write_row(). And, because of the documentation for the REPLACE command, I wasn’t checking that ha_update_row() may have been called — since I didn’t realize a REPLACE could actually do an UPDATE.

Anyway, I wanted to post to say that most of this whole kerfuffle was my fault. Though I think that both the online and code documentation should reflect the fact that a REPLACE can do an UPDATE, the source of the failure was not what I originally wrote. In contrast, ha_write_row() does indeed return ER_FOUND_DUPP_KEY appropriately during a REPLACE call.