Together with Simon J. Mudd of Booking.com, we discussed our perception of what it takes to run online schema migrations on a live, busy system.

While the Oracle/MySQL engineers develop new features or optimize behavior in the MySQL, we of the industry have the operational expertise and understanding of the flow of working with MySQL. In all topics, and in schema migration in particular, there is a gap between what's perceived to be the use case and what the use case actually is. It is the community's task to provide feedback back to Oracle so as to align development to match operations need where possible.

Our meeting included the following:

Need for schema migrations

We presented, based on our experience in current and past companies, and based on our friends of the community's experience, the case for online schema migrations. At GitHub, at Booking.com and in many other companies I'm familiar with, we continuously deploy to production, and this implies continuous schema migrations to our production databases. We have migrations running daily; sometimes multiple per day, some time none.

With continuous deployment, we as Guardians of the Database do not wish to be blockers for the development cycle. On the contrary, we want to be out of the way as soon as possible, other than verifying a requested migration is safe. We wish to be able to deliver a migration at any given time.

Not all companies behave this way; some run a weekly aggregation of migrations. Others yet still use the Though Shall Not Pass DBA model. We tried to depict the various approaches with strong emphasis on our own approach, which is the most demanding of schema migration solutions.

The MySQL ALTER

We proceeded to discuss the in-house ALTER statement & InnoDB online DDL, and pointed out the limitations those impose on "online" operations to the effect of rendering these solutions unused by many. The serialization in replication stream means losing serving capacity, getting lagging replicas. The lack of escape path means a commitment into an hours worth of uninterruptible operation. The lack of resource control implies getting performance degraded throughout the operation.

We briefly touched on the TokuDB's ALTER and how it worked.

Replication solutions

We discussed migrating via replication: running migrations on one or more replicas at a time, finally failing over onto a promoted replica once all replicas are updated.

We know this solution to be in use in companies such as DropBox, Etsy and others. We illustrated our own reasoning for not using this solution:

Increased clock-time for running a migration: running a one-replica-at-a-time or few-replicas-at-a-time can double, triple, quadruple and so forth the overall migration time.

Concurrent migration complexity: and since runtime increases, so does the likelihood of needing to run additional migration at the same time, which highly complicates the flow in a one-at-a-time or few-at-a-time model.

Serving capacity: in this model some, or up to half the number of servers, are non operational. Serving capacity is reduced and we need to have more hardware to support that

Failover: the failover is not smooth; it either includes some outage or some block time, and at any case noticeable in production. Having a planned failover once in a while is OK, but having a failover multiple times a day is too much of a hustle, in our current setup.

Topology complexity: how our topologies always have some special cases, such as cross-DC replication with reduced cross-DC network traffic via intermediate masters, testing replicas with newer versions, developer-dedicated servers and others, that make shuffling of replicas around difficult to automate.

We have not discussed Galera's Rolling Schema Upgrades as we personally do not have the experience of working with it. It solves the failover issue above, but given a "normal" replication tree under the cluster, same problems as above apply.

We concluded with our personal take, that like everything else, we just like to write stuff directly onto our masters, and let the natural replication flow deal with it and get our entire topology to be consistent.

Existing trigger based migrations

We drilled down into the algorithms behind pt-online-schema-change and Facebook's OSC (the latter being rewritten today, not yet released as open source). We elaborated on the pains we saw in trigger based migrations: being unsuspendible, causing lock spaghetti, impacting write latency on the master to the point of a standstill on busy servers, being untestable.

gh-ost

I presented gh-ost, our own, triggerless take on schema migrations. I discussed the logic behind gh-ost and how it decouples migration load from production load; the low impact the triggerless migration has on the master and on the entire replication chain, leading to low, subsecond replication lags throughout the migration and eliminating locking contention on the master. Basically the presentation Tom Krouper and I gave at Percona Live Amsterdam.

Want to Have

We followed up by a list of feature requests we could enjoy. These were largely technical issues gh-ost would benefit from, simplifying its behavior or ensuring its correctness in complex cases. We discussed dropping tables at end of migration, getting more info in the binary logs, GTID issues and more.

Acknowledgements

Thank you to Morgan Tocker for officially inviting us to this gathering. There were quite a few familiar faces in the room, and it was a friendly gathering. Thank you to all the engineering managers with whom we met!

The discussion was lively, friendly and receptive. The Oracle engineers laid out the internals of the online DDL; some of their thoughts on the potential of the JSON format; gave advice on technical issues presented. I'd like to thank them for listening to our take on the subject. There was a discussion on the possible paths Oracle can take to improve online schema operations, and I'd like to thank Oracle for sharing their own thoughts and advice!

]]>http://code.openark.org/blog/mysql/discussing-online-schema-migrations-with-oracles-mysql-engineering-managers/feed07646Three wishes for a new yearhttp://code.openark.org/blog/mysql/three-wishes-for-a-new-year-4
http://code.openark.org/blog/mysql/three-wishes-for-a-new-year-4#commentsWed, 28 Sep 2016 14:20:54 +0000http://code.openark.org/blog/?p=7643(Almost) another new year by Jewish calendar. What do I wish for the following year?

World peace

Good health to all

Relaxed GTID constraints

I'm still not using GTID, and still see operational issues with working with GTID. As a latest example, our new schema migration solution, gh-ost, allows us to test migrations in production, on replicas. The GTID catch? gh-ost has to write something to the binary log. Thus, it "corrupts" the replica with a bogus GTID entry that will never be met in another server, thus making said replica unsafe to promote. We can work around this, but...

I understand the idea and need for the Executed GTID Set. It will certainly come in handy with multi-writer InnoDB Cluster. However for most use cases GTID poses a burden. The reason is that our topologies are imperfect, and we as humans are imperfect, and operations are most certainly imperfect. We may wish to operate on a replica: test something, by intention or mistake. We may wish to use a subchain as the seed for a new cluster split. We may wish to be able to write to downstream replicas. We may use a 3rd party tool that issues a flush tables with read lock without disabling sql_log_bin. Things just happen.

Last executed: a mode where the only thing that counts is the last executed GTID value. If I repoint replica, all it needs to check is "hey this is my last executed GTID entry, give me the coordinates of yours. And, no, I don't care about comparing executed and purged sets, I will trust you and keep running from that point on"

Declarative: GTIDs are generated, are visible in each and every binary log entry, but are completely ignored.

I realize Oracle MySQL GTID is out for some over 3 years now, but I'm sorry - I still have reservations and see use cases where I fear it will not serve me right.

How about my previous years wishes? World peace and good health never came through, however:

]]>http://code.openark.org/blog/mysql/three-wishes-for-a-new-year-4/feed27643gh-ost 1.0.17: Hooks, Sub-second lag control, Amazon RDS and morehttp://code.openark.org/blog/mysql/gh-ost-1-0-17-hooks-sub-second-lag-control-amazon-rds-and-more
http://code.openark.org/blog/mysql/gh-ost-1-0-17-hooks-sub-second-lag-control-amazon-rds-and-more#respondTue, 06 Sep 2016 09:44:14 +0000http://code.openark.org/blog/?p=7632gh-ost version 1.0.17 is now released, with various additions and fixes. Here are some notes of interest:

Hooks

gh-ost now supports hooks. These are your own executables that gh-ost will invoke at particular points of interest (validation pass, about to cut-over, success, failure, status, etc.)

gh-ost will set various environment variables for your executables to pick up, passing along such information as migrated/ghost table name, elapsed time, processed rows, migrated host etc.

Sub-second lag control

At GitHub we're very strict about replication lag. We keep it well under 1 second at most times. gh-ost can now identify sub-second lag on replicas (well, you need to supply with the right query). Our current production migrations are set by default with --max-lag-millis=500 or less, and our most intensive migrations keep replication lag well below 1sec or even below 500ms

No SUPER

The SUPER privilege is required to set global binlog_format='ROW' and for STOP SLAVE; START SLAVE;

If you know your replica has RBR, you can pass --assume-rbr and skips those steps.

RDS

Hooks + No Super = RDS, as seems to be the case. For --test-on-replica you will need to supply your own gh-ost-on-stop-replication hook, to stop your RDS replica at cut-over phase. See this tracking issue

master-master

While active-active are still not supported, you now have greater control over master-master topologies by being able to explicitly pick your master (as gh-ost arbitrarily picks one of the co-masters). Do so by passing --assume-master-host. See cheatsheet.

tungsten replicator

Similarly, gh-ost cannot crawl your tungsten topology, and you are able to specify --tungsten --assume-master-host=the.master.com. See cheatsheet.

Concurrent-rowcount

--exact-rowcount is awesomeness, keeping quite accurate estimate of progress. With --concurrent-rowcount we begin migration with a rough estimate, and execute select count(*) from your_table in parallel, updating our estimate later on throughout the migration

Stricter, safer

gh-ost works in STRICT_ALL_TABLES mode, meaning it would fail rather than set the wrong value to a column.

In addition to unit-testing and production continuous test, a set of local tests is growing, hopefully to run as CI tests later on.

Fixed problems

Fixed time_zone related bug, high unsigned values bug; added strict check for triggers, relaxed config file parsing, and more. Thank you to community contributors for PRs, from ipv6 to typos!

Known issues

Issues coming and going at all times -- thank you for reporting Issues!

We have a confirmed bug with non-UTF charsets at this time. Some other minor issues and feature requests are open -- we'll take them as we go along.

Feedback requests

We are not testing gh-ost on RDS ourselves. We appreciate community feedback on this tracking issue.

We are not testing gh-ost on Galera/XtraDB cluster ourselves. We appreciate community feedback on this tracking issue.

]]>http://code.openark.org/blog/mysql/gh-ost-1-0-17-hooks-sub-second-lag-control-amazon-rds-and-more/feed07632MySQL vs. PostgreSQL, gh-ost perspectivehttp://code.openark.org/blog/mysql/mysql-vs-postgresql-gh-ost-perspective
http://code.openark.org/blog/mysql/mysql-vs-postgresql-gh-ost-perspective#commentsThu, 11 Aug 2016 20:34:10 +0000http://code.openark.org/blog/?p=7603Last week we released gh-ost, GitHub's online schema migration tool for MySQL. As with other open source releases in the MySQL ecosystem, this release was echoed by several "Why not PostgreSQL?" comments. Having been active in open source since many years now, I'm familiar with these responses, and I find this is a good time to share my thoughts. Why? XKCD knows the answer:

I picked one post I wish to address (latest commit: 3dfbd2cd3f5468f035ec86442d2c670a510118d8). The author invested some time writing it. It nicely summarizes claims I've heard over the years, as well as some prejudice. Through responding to this post I will be generalizing thoughts and impressions to address the common reactions. Dear @brandur, let's grab a beer some day; I fundamentally disagree with your post and with its claims.

EDIT: linked post has been updated following this writing; I'd like to thank the author for his consideration. Also see his followup post. The version I've responded to in this post is this commit.

This is not an anti-PostgreSQL post

Disclosure: I appreciate PostgreSQL. I always wanted to be a proficient PostgreSQL user/DBA. I think this project is one of the finest examples of quality open source. It sets some high standards for open source in general, and for RDBMS in particular. I am not emotionally attached to MySQL to the extent that I would hate everything that is not called "MySQL". I never understood this approach. I am not interested in religious wars. I'm an engineer and this post follows engineering guidelines.

Background

gh-ost delivers powerful online schema migrations to MySQL, differentiating itself from existing tools by being triggerless, auditable, controllable, testable, imposing low workload on the migrated master. It addresses the same problem addressed by existing tools as of 2009.

Feature X

The most basic premise of this post is: MySQL does not have feature X, PostgreSQL does, therefore PostgreSQL.

We'll discuss the truth of the above shortly, but let's first discuss the essence of this guideline.

It should be generally agreed that a statement of the form "A doesn't have feature X therefore B" is incomplete. We understand complex systems have varying feature sets.

MySQL has some features PostgreSQL doesn't. Take, as example, the feature R: MySQL has got it since ages ago, and yet PostgreSQL is slow to adapt it, and relied on 3rd party solutions for many years. MySQL's implementation of R is far more elaborate than PostgreSQL's.

But if we follow the rule suggested above, we must now migrate from PostgreSQL to MySQL, because PostgreSQL does not have feature R (or one of its variants). Infinite loop!

In practice, we evaluate the pros and cons, the features the products A and B have or do not have. Which feature is more important to us? X or R? Is one of them fundamentally required for our operation? Can we work around it if we don't get it directly from the product? That, and experimentation, is the way an engineer should approach a choice of technology.

In the world of RDBMS we are interested, among others and in no particular order, in write latency and throughput, read scale out, durability, loss of data in the event of failure, failure promotion schemes, DR, consistency, SQL features, etc. by this list alone it is impossible to claim "PostgreSQL is better than MySQL" nor "MySQL is better than PostgreSQL".

The particular claim and advice

The author suggests we should be using PostgreSQL because it inherently solves the problem for which we embarked on developing gh-ost. That is, that PostgreSQL supports true online schema changes. That statement is misleading and I resent the way that statement is delivered.

The post does not mention that PostgreSQL supports online schema changes for a limited set of operations. I went ahead to double check with the PostgreSQL documentation. I love the documentation! It is detailed and honest. I'm thoroughly satisfied that PostgreSQL only supports a limited set of online operations. I go against the fact the post does not mention that, and leads us to "understand" PostgreSQL has a magic wand.

Online table operations are supported in PostgreSQL for adding/removing indexes, for removing columns, and for adding columns under certain conditions. As an example, adding a nullable column is an online operation, whereas adding a column with default value is a locking operation.

A very big part of our schema migration including adding/removing indexes and adding columns. Many of these operations fall under the lockless, online operations PostgreSQL would allow. However a large part of our migrations also consists of adding columns with default values, changing data types (e.g. From INT to BIGINT), changing character characteristics (e.g. length of text column), space reclamation, and others. These changes are blocking in PostgreSQL.

The premise of the post now turns to: it's a pity you invested time and money in developing a tool that solves 100% of your problems when you could have switched to PostgreSQL which would solve 40% of your problems!

If I were to insist my fellow engineers at GitHub migrate to PostgreSQL in order to solve the migration problem, and then, once this technical transition is complete let them know 60% of the migrations are not at all addressed and that we are still stuck with the very same problem we started with, I would not be a very popular engineer.

Moreover

"the same advancements never happened in MySQL" is a false statement.

As mentioned in the gh-ost announcement, MySQL as of 5.6 does indeed support online, non blocking alter table. In fact, it supports many more variants of online alter table than PostgreSQL does (however, noticeable difference is that PostgreSQL makes those changes transactional whereas MySQL does not).

Also as mentioned, one of the shortcomings of the MySQL implementation is that it is aggressive, and may cause a high impact on the running master. In my understanding the PostgreSQL implementation is no different. There's nothing to cause throttling, to play nice with the running workload. Yes, in PostgreSQL you can Ctrl-C your ALTER, but who wants to Ctrl-C a 10 hour operation?

gh-ost addresses that as well. Its throttling gives super powers over the migration process, kicking in throttling based on master load, replication lag, various human controlled criteria, effectively making it very lightweight on the master.

Misdirection?

"there's a level of seemingly willful misdirection here that I just can't wrap my head around"

XKCD to the rescue again:

I dare say this is not the kind of thing a person would say in person, and the accusation is rather severe. It is also ironic. Dear author, consider:

PostgreSQL does not really solve 100% of the problem gh-ost does, and yet you claim we'd be better off with PostgreSQL.

MySQL does indeed provide more variants of online alter table than PostgreSQL does, and yet you claim it has no online alter capabilities.

I might claim there's a seemingly willful misdirection in your post. I might claim nowhere in your write up do you mention the deficiencies in PostgreSQL.

Instead, I'd rather like to think that you, and others, are misinformed, basing your opinion on rumors and prejudice instead of facts.

I also observe that people all around the world like to willfully differentiate themselves from others. Even in tech. this is the topic for another post, but consider explaining to a complete outsider, say your doctor, why people who work in tech, are engineers, work with data, work with databases, work with relational databases, work with open source relational databases, people who have so much shared experience, still insist on "us and them", and seek to see the negative in the other. Sheesh.

Paraphrasing a memorable sarcastic quote from the movie Erin Brockovich: the fact so many of the largest tech companies today choose to use MySQL as their backend database does not mean it's crap.

No. We really think MySQL does a good job. It is not perfect. We work around some of its limitations.

Claims

The claim "you'd be better off with PostgreSQL" (not a quote from aforementioned post) cannot be made without understanding the specific workload of a company/product. It would be presumptuous of me to approach a PostgreSQL based company and say "oh hey why use PostgreSQL? You'd be better off with MySQL!"

It makes perfect sense to say "PostgreSQL handles problem X better than MySQL" or even "if your one most important requirement is X, you should switch to PostgreSQL". Otherwise claiming one database is wholly better than the other cannot be taken seriously.

Deficiencies? Any project of scale has deficiencies. It is granted. We observe and measure, and take features and deficiencies into calculation, and that makes for good engineering.

If you're using PostgreSQL and it works well for you, you're doing the right thing.

It you're using MySQL and it works well for you, you're doing the right thing.

If you found that PostgreSQL works better for you where MySQL does not, and you decided to switch over, you're doing the right thing.

If you found that MySQL works better for you where PostgreSQL wasn't, and you decided to switch over, you're doing the right thing.

If you found that PostgreSQL works better for you where MySQL wasn't, but decided to stick with MySQL because migrating would be too costly, you're doing the right thing.

If you found that MySQL works better for you where PostgreSQL wasn't, but decided to stick with PostgreSQL because migrating would be too costly, you're doing the right thing.

If you pick one over the other because of licensing constraints, you're doing the right thing.

If you choose to switch over because of rumors, prejudice, FUD, politics, religion, you're doing it wrong.

Final personal note, on pride

Oh hey, XKCD again. But I would like to ask an honest question: if some pg-gh-ost were to be released, a tool that would solve 100% of your PostgreSQL migrations requirements, doing it better than PostgreSQL does, covering all cases, throttling as your daily sqoop imports kick in, as your rush hour traffic kicks in, giving you far and wide greater control over the migration process, what would you do?

Would you write an offensive post filled with accusations, ranting about the deficiencies of PostgreSQL and how people even consider using such a database that needs a third party tool to do a better job at migrations? Would you tweet something like "Or... Use MySQL!"

Or would you embrace a project that enriches the PostgreSQL ecosystem, makes it even a greater database to work with, understanding PostgreSQL is not yet perfect and that more work need to be done?

I take pride in my craft and love making an impact; if we ever do meet for beer I'm happy to share more thoughts.

gh-ost now powers our production schema migrations. We hit some serious limitations using pt-online-schema-change on our large volume, high traffic tables, to the effect of driving our database to a near grinding halt or even to the extent of causing outages. With gh-ost, we are now able to migrate our busiest tables at any time, peak hours and heavy workloads included, without causing impact to our service.

gh-ost supports testing in production. It goes a long way to build trust, both in integrity and in control. Are your databases just too busy and you cannot run existing online-schema-change tools? Have you suffered outages due to migrations? Are you tired of babysitting migrations that run up to 3:00am? Tired of being the only one tailing logs? Please, take a look at gh-ost. I believe it changes online migration paradigm.

]]>http://code.openark.org/blog/mysql/introducing-gh-ost-triggerless-online-schema-migrations/feed17596Solving the non-atomic table swap, Take III: making it atomichttp://code.openark.org/blog/mysql/solving-the-non-atomic-table-swap-take-iii-making-it-atomic
http://code.openark.org/blog/mysql/solving-the-non-atomic-table-swap-take-iii-making-it-atomic#commentsThu, 07 Jul 2016 12:54:25 +0000http://code.openark.org/blog/?p=7577With the unintended impression of becoming live blogging, we now follow up on Solving the non-atomic table swap, Take II and Solving the Facebook-OSC non-atomic table swap problem with a safe, blocking, atomic solution

As it turns out, there's a simpler solution which overcomes both the above. As with math and physics, the simpler solution is often the preferred one. But it took those previous iterations to gather a few ideas together. So, anyway:

Safe, locking, atomic, asynchronous table swap

Do read the aforementioned previous posts; the quick-quick recap is: we want to be able to LOCK a table tbl, then do some stuff, then swap it out and put some ghost table in its place. MySQL does not allow us to rename tbl to tbl_old, ghost to tbl if we have locks on tbl in that session.

The solution we offer is now based on two connections only (as opposed to three, in the optimistic approach). "Our" connections will be C10, C20. The "normal" app connections are C1..C9, C11..C19, C21..C29.

Connections C11..C19, newly incoming, issue queries on tbl but are blocked due to the LOCK

Connection C20: RENAME TABLE tbl TO tbl_old, ghost TO tbl
This is blocked due to the LOCK, but gets prioritized on top connections C11..C19 and on top C1..C9 or any other connection that attempts DML on tbl

Connections C21..C29, newly incoming, issue queries on tbl but are blocked due to the LOCK and due to the RENAME, waiting in queue

Connection C10: checks that C20's RENAME is applied (looks for the blocked RENAME in processlist)

Connection 10: UNLOCK TABLES
BAM! The RENAME is first to execute, ghost table is swapped in place of tbl, then C1..C9, C11..C19, C21..C29 all get to operate on the new and shiny tbl

Some notes

We create tbl_old as a blocker for a premature swap

It is allowed for a connection to DROP a table it has under a WRITE LOCK

A blocked RENAME is always prioritized over a blocked INSERT/UPDATE/DELETE, no matter who came first

What happens on failures?

Much fun. Just works; no rollback required.

If C10 errors on the CREATE we do not proceed.

If C10 errors on the LOCK statement, we do not proceed. The table is not locked. App continues to operate as normal.

If C10 dies just as C20 is about to issue the RENAME:

The lock is released, the queries C1..C9, C11..C19 immediately operate on tbl.

C20's RENAME immediately fails because tbl_old exists.
The entire operation is failed, but nothing terrible happens; some queries were blocked for some time is all. We will need to retry everything

If C10 dies while C20 is blocked on RENAME: Mostly similar to the above. Lock released, then C20 fails the RENAME (because tbl_old exists), then all queries resume normal operation

If C20 dies before C10 drops the table, we catch the error and let C10 proceed as planned: DROP, UNLOCK. Nothing terrible happens, some queries were blocked for some time. We will need to retry

If C20 dies just after C10 DROPs the table but before the unlock, same as above.

If both C10 and C20 die, no problem: LOCK is cleared; RENAME lock is cleared. C1..C9, C11..C19, C21..C29 are free to operate on tbl.

No matter what happens, at the end of operation we look for the ghost table. Is it still there? Then we know the operation failed, "atomically". Is it not there? Then it has been renamed to tbl, and the operation worked atomically.

A side note on failure is the matter of cleaning up the magic tbl_old. Here this is a matter of taste. Maybe just let it live and avoid recreating it, or you can drop it if you like.

Impact on app

App connections are guaranteed to be blocked, either until ghost is swapped in, or until operation fails. In the former, they proceed to operate on the new table. In the latter, they proceed to operate on the original table.

Impact on replication

Replication only sees the RENAME. There is no LOCK in the binary logs. Thus, replication sees an atomic two-table swap. There is no table-outage.

Conclusion

This solution satisfies all we wanted to achieve. We're unlikely to give this another iteration. Well, if some yet-more-elegant solution comes along I'll be tempted, for the beauty of it, but the solution offered in this post is simple-enough, safe, atomic, replication friendly, and should make everyone happy.

Quick, quickest recap:

We are working on a triggerless online schema migration solution. It is based on an asynchronous approach, similarly to the FB osc and as opposed to the synchronous solution as used by pt-online-schema-change.

We asynchronously synchronize (is that even a valid statement?) between some table tbl and a ghost table ghost, and at some time we want to cut-over: swap the two; kick out tbl and put ghost in its place and under its name.

However, we cannot use the single statement rename tbl to tbl_old, ghost to tbl, because we use the asynchronous approach, where at the time we lock tbl for writes, we still have some events we need to process and apply onto ghost before swapping the two.

And MySQL does not allow a lock tables tbl write; ... ; rename tbl to tbl_old, ghost to tbl.

In Solving the Facebook-OSC non-atomic table swap problem we suggested a way that works, unless when it doesn't work. Read the caveat at the end of the post. Premature death of a connection that participates in the algorithm causes a chain reaction that leads to the premature execution of the rename statement, potentially before we've applied those remaining events. This leads to data inconsistency between the old table and the new table, and is unacceptable.

To that effect, we were more inclined to go with the Facebook solution, which makes a two-step: lock tables tbl write; alter table tbl rename to tbl_old; ... ; alter table ghost rename to tbl;

This two-step solution is guaranteed not to have data inconsistency. Alas, it also implies an outage. There's a brief moment, in between the two renames, and during that time where we apply those last changes, where the table tbl is simply not there.

Not all applications will fail gracefully on such a scenario.

UDF

We looked at a solution based on UDFs, where we would create global wait conditions, that are not connection based.

We don't like UDFs. You need to compile them for every new version. Puppetizing their setup is not fun. We wouldn't like maintaining this. We wouldn't like doing the operations for this. Neither would the community.

We want to make this a community solution. Can we do without UDF?

Rewriting MySQL

We wish to avoid forking our own version of MySQL. It's not what we do and it's a pain.

A pure MySQL solution?

We found a solution to embrace; it is optimistic, and safe. hat optimistic means is explained further on, but let's discuss safe:

The previous solution we came up with as unsafe because breakage of a single component in the algorithm would lead to inconsistent data. The algorithm itself was fine, as long as no one would break it from the outside. This is the concern: what if some crazy cronjob that cleans up connections (kills idle connections, kills long running transactions) or some unfortunate user command kills one of the connections involved in the cut-over phase? This is not something that would happen every day, but can we protect against it? Our priority is to keep our data intact.

The solution allows breakage. Even in the face of death of connections, data is not lost/corrupted, and at worst -- causes a FB-like, recoverable outage scenario.

A step towards the solution, a flawed one

I wish to illustrate something that looks like it would work, but in fact has a hidden flaw. We will later improve on that solution.

Let's assume we have tbl, ghost tables. We execute the following by multiple connections; we call them C1, C2, C3, ...:

We apply those last event we need to apply onto ghost. No new events are coming our way because tbl is blocked.

C18: rename table tbl to tbl_old, ghost to tbl; (blocked as well)

C1: unlock tables; (everything gets released)

Let's consider the above, and see why it is flawed. But first, why it would typically work in the first place.

Connections C2, ..., C17 came first, and C18 came later. Nevertheless MySQL prioritizes C18 and moves it up the queue of waiting queries on tbl. When we unlock, C18 is the first to execute.

We only issue the rename once we're satisfied we've applied those changes. We only unlock once we're satisfied that the rename has been executed.

If for some reason C1 disconnects before we issue the rename - no problem, we just retry from scratch.

What's the flaw?

We rename when C1 holds the lock. We check with C1 that it is alive and kicking. Yep, it's connected and holding the lock. Are you sure? Yep, I'm good! Really really sure? Yep! OK then, let's rename!

"Oh darn", says C1, "now that you went ahead to rename, but just before you actually sent the request, I decided to take time off and terminate". Or, more realistically, some job would kill C1.

What happens now? The rename is not there yet. All those queries get released, and are immediately applied onto tbl, and then the rename applies, kicks all those changes into oblivion, and puts ghost in place, where it immediately receives further writes.

Those blocking queries were committed but never to be seen again.

So here's another way to look at the problem: the rename made it through even though the connection C1 died just prior to that, whereas we would have loved the rename to abort upon such case.

Is there a way in MySQL to cause an operation to fail or block when another connection dies? It's the other way around! Connections hold locks, and those get released when they die!

Three step, safe, optimistic solution

We apply those last event we need to apply onto ghost. No new events are coming our way because tbl is blocked.

C18: checking that C1 is still alive, then rename table tbl to tbl_old

C19: checking to see that C18's rename is in place (via show processlist), and that C1 is still alive; then issues: rename table ghost to tbl

(meanwhile more queries approach tbl, it doesn't matter, they all get deprioritized, same as C2...C17)

C1: unlock tables

What just happened? Let's first explain some stuff:

C18's rename gets prioritized over the DMLs, even though it came later. That is how MySQL prioritizes queries on metadata-locked tables.

C18 checks C1 is still alive, but as before, there's always the chance C1 will die just at the wrong time -- we're going to address that.

C19 is interested to see that C18 began execution, but potentially C18 will crash by the time C19 actually issues its own rename -- we're going to address that

C19's query sounds weird. At that time tbl still exists. You'd expect it to fail immediately -- but it does not. It's valid. This is because tbl's metadata lock is in use.

C19 gets prioritized over all the DMLs, but is known to be behind C18. The two stay in same order of arrival. So, C18 is known to execute before C19.

When C1 unlocks, C18 executes first.

Metadata lock is still in place on tbl even though it doesn't actually exist, because of C19.

C19 operates next.

Finally all the DMLs execute.

What happens on failures?

If C1 dies just as C18 is about to issue the rename, we get an outage: tbl is renamed to tbl_old, and the queries get released and complain the table is just not there.

C19 will not initiate because it is executed after C18 and checks that C1 is alive -- which turns to be untrue.

So we know we have outage, and we quickly rename tbl_old to tbl; and go drink coffee, then begin it all again.

The outage is unfortunate, but does not put our data in danger.

If C1 happens to die just as C19 is about to issue its rename, there's no data integrity: at this point we've already asserted the tables are in sync. As C1 dies, C18 will immediately rename tbl to tbl_old. An outage will occur, but not for long, because C19 will next issue rename ghost to tbl, and close the gap. We suffered a minor outage, but no rollback. We roll forward.

If C18 happens to die just as C19 is about to issue its rename, nothing bad happens: C19 is still blocking for as long as C1 is running. We find out C18 died, and release C1. C19 attempts to rename ghost onto tbl, but tbl exists and the query fails. The metadata lock is released and all the queries resume operation on the original tbl. The operation failed but without error. We will need to try the entire cycle again.

If both C1 and C18 fail at the time C19 is about to begin its rename, same as above.

If C18 fails as C19 is already in place, same as above.

If C1 fails as C19 is already in place, it's as good as having it issue the unlock tables. We're happy.

If C19 fails at any given point, we suffer outage. We revert by rename tbl_old to tbl

This solution relies on the notion that if a previous connection failed, we would not be able to rename ghost to tbl because the table would still be there. That's what we were looking for; but instead of looking at locks, which get released when a connection terminates, we used a persistent entity: a table.

Conclusion

The algorithm above is optimistic: if no connections get weirdly killed, it's a valid locking solution, and queries & app are unaware that anything happened (granted, app will notice write latency). If connections do get weirdly killed, we get table-outage at worst case -- an outage that is already considered to be a valid solution anyhow. The algorithm will not allow data corruption.

]]>http://code.openark.org/blog/mysql/solving-the-non-atomic-table-swap-take-ii/feed47567Solving the Facebook-OSC non-atomic table swap problemhttp://code.openark.org/blog/mysql/solving-the-facebook-osc-non-atomic-table-swap-problem
http://code.openark.org/blog/mysql/solving-the-facebook-osc-non-atomic-table-swap-problem#commentsTue, 03 May 2016 06:42:07 +0000http://code.openark.org/blog/?p=7553We present a way to use an atomic, blocking table swap in the Facebook Online-Schema-Change solution, as well as in a rumored, other Online-Schema-rumored-Change solution. Update: also a caveat.

Quick recap (really quick)

pt-online-schema-change and facebook-osc are two popular online-schema-change solutions for MySQL. They both use triggers, but in different ways. While the Percona tool uses synchronous table updates, such that any INSERT|UPDATE|DELETE on the modified table causes an INSERT|UPDATE|DELETE on a ghost table, in the Facebook tool all cause an INSERT on a changelog table, which is then iterated, read, having entries applied on the ghost table.

The TL;DR is that DMLs on the table propagate synchronously, within same transaction in the Percona tool, and asynchronously, with lag, in the Facebook tool.

What's the problem with the table swap?

In the Percona tool, once the logic is satisfied the copy is complete, we issue this query:

RENAME TABLE tbl TO tbl_old, tbl_new TO tbl;

This is an atomic, two table RENAME operation.

However with the asynchronous nature of the Facebook tool, such a RENAME would be a mistake. We must first block writes to the modified table, then make sure we have iterated the changelog table to the point of lock, apply those changes onto the ghost table, and only then do the swap.

Since alter table causes an implicit commit in innodb, innodb locks get released after the first alter table. So any transaction that sneaks in after the first alter table and before the second alter table gets a 'table not found' error. The second alter table is expected to be very fast though because copytable is not visible to other transactions and so there is no need to wait.

What the FB solution means

It means for a very brief duration, the table is not there. Your app will get errors.

Of course, we should be able to handle errors anytime, aywhere. But the honest truth is: we (as in the world) do not. Many apps will fail ungracefully should they get a table not found error.

An atomic swap, as compared, would make for briefly blocking operations, making the app ignorant of the swap.

Rumor

Rumor has it that we at GitHub are developing a new, triggerless, Online Schema Change tool. It is rumored to be based off binary logs and is rumored to have lots of interesting rumored implications.

Such rumored implementation would have to be asynchronous by nature, or so rumors say. And as such, it would fall for the same non-atomic table swap problem.

Solution

Once we heard it was rumored we were working on a triggerless online schema change tool, we realized we would have to solve the non-atomic swap problem. What we did was to gossip about it in between ourselves, which led to three different rumors of a solution, eventually manifested as three different working solutions. All three solutions make for blocking queries on the app's side. I will present one of these solution here, based on voluntary locks.

The idea is to make a table locked without actually issuing a LOCK TABLE statement, such that we would be able to run a RENAME TABLE operation, that would wait until we say it's good to complete.

Let's assume:

Our table is tbl

Ghost table (table onto which we've actually made the changes) is tbl_new

Ignore the RELEASE_LOCK for now, this is merely cleanup. The query attempts to read one row from tbl where GET_LOCK('ding')>=0. But 'ding' is locked, hence the entire query blocks.
Otherwise, other queries on tbl (both reads and writes) are running fine.

#3:

RENAME TABLE tbl TO tbl_old, tbl_new TO tbl;

Now the magic begins. The RENAME operation cannot proceed while queries are executing on tbl. In particular, it waits on #2 to complete. But #2 is blocked on #1, so it does not complete. Our RENAME is also blocked!
There are further two consequences that work to our advantage:

Any further incoming INSERT|UPDATE|DELETE on tbl is now unable to proceed; such queries will now wait for the RENAME to complete. So no further updated on tbl are being applied. App is blocked

tbl_new is nonblocked! And this is because how RENAME works internally. Since it couldn't satisfy the first clause, it doesn't even check for the second, and does not place a LOCK on tbl_new.

OSC:
Now that no further writes are possible on tbl, we satisfy ourselves that we've iterated to the last of the changelog entries and have applied changes to tbl_new. Exactly how we satisfy ourselves is a matter of implementation. Rumor is that we use a rumor that the last entry was handled in our rumored solution. That last part is actually not a pun.
We are now content that all changes have been applied to tbl_new.

#1:

SELECT RELEASE_LOCK('ding');

Ding! Connection #2 gets released, reads some row from tbl (but no one is actually interested in the result set) and completes. The #3 RENAME is not blocking on anything anymore. It executes. The tables are swapped. Once they are swapped, any INSERT|UPDATE|DELETEs that were pending on tbl are released and App is unblocked.
The atomic swap is complete.

Implementation

Agony. This workaround is agonizing. Is agonization a word? By rumor written in Go, our OSC has this implemented via goroutines, and the code is one of those atrocities you are ashamed to look at. Well, it's OK under the circumstances. But really, implementing this is painful, and actually more complicated than the above description. Why is that?

In the above we make fro two blocking operations: #2 and #3. We must not proceed to #3 before #2 is applied, and we must not proceed to OSC completion before #3 is applied. But how does our code know that it's being blocked? If it's being blocked, it can't tell me anything, because it's blocked. If it's not blocked yet, it can tell me it's not blocked yet, but I'm really interested in knowing the time it gets blocked.

But preferably the exact time, or near exact, because one we start blocking, App suffers. Connections accumulate. We really want to make the swap as quick as possible (and by rumor we have a rollback & retry mechanism for this operation if it exceeds X seconds).

Unfortunately the solution involves polling. That is, Once we issue #2 (asynchronously, right? It's blocking), we aggressively poll SHOW PROCESSLIST and look for that blocked query. And the same for #3. Polling is a form of necessary ugliness in this flow.

Other solutions

The other two solutions do not use a voluntary lock. Instead:

Use a LOCK on some yet another table and a query involving that table JOINed with tbl

A SELECT ... FOR UPDATE on yet another table followed by a SELECT on the locked row on that table JOINed with tbl.

We leave the implementation as an exercise for the reader.

Can't we just make the RENAME work under LOCK?

Yeah. That's what the Facebook people said. "Hey, we can just fix this".

Update: caveat

Should connection #1 or connection #2 die unexpectedly before we are satisfied the events have all been applied, the `RENAME` gets unblocked due to the collapse of locks, and we end up with a premature swap of the tables, potentially before we have applied the latest entries from the changelog table. This was noted by my colleague Gillian Gunson, and we keep looking into this.

]]>http://code.openark.org/blog/mysql/solving-the-facebook-osc-non-atomic-table-swap-problem/feed27553MySQL Community Awards 2016: the Winnershttp://code.openark.org/blog/mysql/mysql-community-awards-2016-the-winners
http://code.openark.org/blog/mysql/mysql-community-awards-2016-the-winners#respondFri, 22 Apr 2016 17:03:42 +0000http://code.openark.org/blog/?p=7550The MySQL Community Awards initiative is an effort to acknowledge and thank individuals and corporates for their contributions to the MySQL ecosystem. It is a from-the-community, by-the-community and for-the-community effort. The committee is composed of an independent group of community members of different orientation and opinion, themselves past winners or known contributors to the community.

The 2016 community awards were presented on April 21st, 2016, during the keynotes at the Percona Live conference. The winners are:

MySQL Community Awards: Community Contributor of the year 2016

Bill KarwinBill has been working with the community for years, helping them understand SQL. Bill is the author of the great book "SQL Antipatterns". He has given a large amount of help on sites such as StackOverflow, Quora, and of course many conference talks. Bill has provided a huge amount of help to the community directly.

Domas MituzasDomas Mituzas started in the MySQL ecosystem as a MySQL Support Engineer at MySQL AB. Since he had some spare time, he did a lot of work to scale MySQL at Wikipedia. He is now a small data engineer at Facebook, mostly working with user-facing data systems. He continues to write very interesting blog posts and bug reports. Domas is responsible for giving us MyDumper, PoorMansProfiler, and the infamous Query Cache tuner!

Yoshinori MatsunobuYoshinori Matsunobu is currently leading the MyRocks effort to get the RocksDB storage engine for MySQL into production at Facebook. Previously (amongst his other accomplishments) he created HandlerSocket, and implemented MHA to support failover automation for MySQL – both of which have been used at many companies. He is a frequent speaker at community events, and his tutorials and slide decks do a lot to increase expertise in the community. He is a frequent bug reporter with a focus on replication (RBR, semi-sync).

MySQL Community Awards: Application of the year 2016

MaxScale
MariaDB MaxScale is an Open Source dynamic routing gateway. It is widely used as a database load balancer for Galera Cluster deployments, for standard replication setups, and as a replication relay. It has a modular architecture which includes plugins for read-write splitting and query logging. It serves a variety of tasks, from load balancing to database firewall filtering to binlog server and is widely used in production in large topologies.

MySQL Community Awards: Corporate Contributor of the year 2016

Booking.com
Booking.com has been a massive contributor to the MySQL ecosystem, sending many of their excellent DBAs to various conferences to talk. They have provided an innovative test bed for testing out, and giving a wealth of invaluable feedback about new releases (across a wide variety of MySQL and related software projects). Booking.com contributes to Open Source foundations, projects and communities by donating, sponsoring, making code contributions and hosting events. The quality of MySQL is undoubtedly much better today because of their help and input.

Congrats to all winners!

Committee members

Baron Schwartz

Colin Charles

Daniël van Eeden

Davi Arnaut

Frederic Descamps

Geoffrey Anderson

Giuseppe Maxia

Justin Swanhart

Mark Leith

Morgan Tocker

Philip Stoev

Ronald Bradford

Santiago Lertora

Co-secretaries:

Jeremy Cole

Shlomi Noach

Special thanks

Thank you to this year's anonymous sponsor for donating the goblets!

Thank you to Colin Charles for acquiring and transporting the goblets!

]]>http://code.openark.org/blog/mysql/mysql-community-awards-2016-the-winners/feed07550MySQL Community Awards 2016: Call for Nominations!http://code.openark.org/blog/mysql/mysql-community-awards-2016-call-for-nominations
http://code.openark.org/blog/mysql/mysql-community-awards-2016-call-for-nominations#respondFri, 05 Feb 2016 15:09:06 +0000http://code.openark.org/blog/?p=7543The 2016 MySQL Community Awards event will take place, as usual, in Santa Clara, during the Percona Live Data Performance Conference, April 2016.

The MySQL Community Awards is a community based initiative. The idea is to publicly recognize contributors to the MySQL ecosystem. The entire process of discussing, voting and awarding is controlled by an independent group of community members, typically based of past winners or their representatives, as well as known contributors.

It is a self-appointed, self-declared, self-making-up-the-rules-as-it-goes committee. It is also very aware of the importance of the community; a no-nonsense, non-political, adhering to tradition, self criticizing committee.

The Call for Nominations is open. We are seeking the community’s assistance in nominating candidates in the following categories:

MySQL Community Awards: Community Contributor of the year 2016

This is a personal award; a winner would a person who has made contribution to the MySQL ecosystem. This could be via development, advocating, blogging, speaking, supporting, etc. All things go.

MySQL Community Awards: Application of the year 2016

An application, project, product etc. which supports the MySQL ecosystem by either contributing code, complementing its behaviour, supporting its use, etc. This could range from a one man open source project to a large scale social service.

MySQL Community Awards: Corporate Contributor of the year 2016

A company who made contribution to the MySQL ecosystem. This might be a corporate which released major open source code; one that advocates for MySQL; one that help out community members by... anything.

Jeremy Cole and myself (Shlomi Noach) are acting as co-secretaries; we will be non-voting (except for breaking ties).

The committee communicates throughout the nomination and voting process to exchange views and opinions.

The awards

Awards are traditionally donated by some party whose identity remains secret. We are now securing the donation, but if you feel an urgent need to be an anonymous donator, please contact us in private, and thank you!

Support

This is a community effort; we ask for your support in spreading the word and of course in nominating candidates. Thanks!