If you’re in Barcelona next week you may be interested in the MySQL Meetup being held there by the Barcelona MySQL Meetup group on Wednesday 7th July at 7pm. I’ll be doing a talk on MySQL Failover and Orchestration and there will be opportunity to talk about MySQL and related topics afterwards.

More information can be found on their web page. I look forward to seeing you there.

Introduction

There have been several posts on setting up and using orchestrator. Most of these are quite simple and do not discuss in detail some of the different choices you may want to consider when setting up orchestrator in a real production environment. As I have been using orchestrator for some time I thought it would be good to discuss how a real production orchestrator setup might be achieved. So here are some thoughts which I would like to share.

Basics

The basics for setting up orchestrator are to setup the orchestrator app and configure it to be able to write to a backend MySQL database.

Configuration requires telling orchestrator how to find the MySQL instances you want to monitor and perhaps to forget old servers that are no longer being used. For a small setup you may be happy to do this by hand but adding automation hooks to the provisioning or decommissioning process of MySQL hosts can come in handy. You have the choice of using the command line:

Shell

1

2

$orchestrator-cdiscover[-i<host>:<port>]or

$orchestrator-cforget[-i<host>:<port>]

or using the http interface and URLs like

Shell

1

2

http://orchestrator.mycompany.com/api/discover/:<host>/:<port>/or

http://orchestrator.mycompany.com/api/forget/:<host>/:<port>/

according to which method is easiest to setup. Note: discovery of new servers in an existing replication chain should not be necessary as orchestrator will normally be able to figure this out on its own.

Handling Master or Intermediate master failover

Failover behaviour also needs to be configured. While orchestrator is able to adjust the replication topology if a master or intermediate master fails sometimes, and this is more so for a primary master, additional external tasks may be needed to ensure the completion of the failover process. This may also good for notifying appropriate people or systems prior to and after dealing with the failover.

This is handled in orchestrator.conf.json with the following settings:

1

2

3

4

5

PreFailoverProcesses

PostFailoverProcesses

PostUnsuccessfulFailoverProcesses

PostMasterFailoverProcesses

PostIntermediateMasterFailoverProcesses

which will run scripts on the active orchestrator node to achieve the desired configuration changes. You can use these hooks to do tasks such as:

notify people or systems of the issue that’s been seen

change the configuration of external systems which need to be aware of a master or intermediate master failure

tell the applications of the failure and where to find the new master

All of these tasks will be specific to your environment but there’s plenty of freedom here to hook orchestrator in even if it is not directly aware of “the outside world”.

Selection of servers to be eligible [intermediate] masters

You may have special servers, such as those used for testing, or located in a different part of your network, which you do not want to promote to be a master or intermediate master. Orchestrator is able to allow you to indicate this with settings such as

1

2

3

4

5

6

ProblemIgnoreHostnameFilters

PromotionIgnoreHostnameFilters

RecoveryIgnoreHostnameFilters

RecoverMasterClusterFilters

RecoverIntermediateMasterClusterFilters

OSCIgnoreHostnameFilters

This works pretty well and covers almost all cases where you need to handle special cases for one or more reasons.

Failover PromotionRules

For larger setups where there are more servers in the cluster you may prefer orchestrator to failover to one or more specific servers and there there are some promotion rules you can apply to adjust the priority of which servers are preferred as a candidate when a failure occurs.

Currently this is configured on a per MySQL instance basis giving it one of the types Prefer, Neutral (default value) or Must Not. (The code does have two other options Must and Prefer Not but these are not implemented.)

Configuration can be done via the command line via:

Shell

1

$ orchestrator -c register-candidate [ -i <host>:<port> ]

though here the configured default promotion rule is used (Prefer), but you can also use the http interface where you can explicitly state the required promotion rule using:

It is also possible to pull out the promotion rules as a bulk operation using the url:

Shell

1

http://orchestrator.mycompany.com/api/bulk-promotion-rules

This is convenient if you want to configure this centrally rather than individually on each MySQL instance.

High Availability Setup

If you really care about your MySQL servers not failing you probably also care about orchestrator itself not failing, so what can be done to make this service more reliable?

Orchestrator itself comprises two parts: the orchestrator application and the MySQL backend it writes to.

As far as the orchestrator app is concerned it is easy to configure more than one server. All apps use the same configuration and talk to the same MySQL backend database. They co-operate by writing to a common table in the backend database and electing a leader (or active node) which actively polls all known MySQL instances. The other nodes are working but doing nothing. Should the elected leader stop working another app will be chosen and takeover the process
of checking all MySQL instances. So setting up more than one app is very straightforward and usually it is good to setup orchestrator app servers in the same locations or datacentres where your MySQL servers are running.

Once you have more than one orchestrator app running it is convenient to use some sort of load balancing technology to make orchestrator visible via a single URL. This process works quite nicely as normal usage of the GUI can work on any of the orchestrator nodes, even if the active monitoring only takes place on one of them. This is where it may be convenient to add an authentication and https layer, neither of which is handled directly by orchestrator but which can easily be added using something like nginx.

The URL

1

http://orchestrator.mycompany.com/web/status

is very convenient as it shows you the apps which are running, their version which node is the active node. You can see an example below on some testing servers I use:

orchestrator web status

As far as Orchestrator’s handling of the backend MySQL server going away this is something which perhaps deserves a comment. Orchestrator has a backend database and expects it to be there. So configuring a single MySQL server as orchestrator’s backend is probably not ideal. Standard MySQL replication will give you a spare and I think that for most cases this is in practice good enough.

If the “orchestrator db” master fails it is unlikely that orchestrator will be able to fix this. The paranoid may like to consider using something like Galera, MySQL Cluster or even the new MySQL Group Replication (and InnoDB Cluster when it is released), but all that orchestrator really cares about is being able to write to a backend database so it can store state and use that state later. Additional auditing, logging, and history information is kept but none of this is critical and write rates on the backend are generally low unless the number of instances you monitor is very high. So adjusting the orchestrator configuration to talk to a different MySQL host, or alternatively to make the configuration use a virtual IP or DNS CNAME gives you the flexibility to be able to make quick changes without needing to adjust the orchestrator configuration itself.

While I use standard MySQL replication to provide a spare backend I also keep a record of the MySQL instances ( host:port ) so even under some completely strange broken setup I can feed this information into orchestrator via the discovery interface into an empty configuration and have orchestrator working again in a few seconds. A convenient URL http://orchestrator.mycompany.com/api/bulk-instances is designed to simplify this task.

So all in all the HA setup is quite easy to get going and the good thing about that is then it is easy to upgrade any of the nodes just by stopping it, adjusting binaries and restarting, without having to worry about the “MySQL Failover Service” not being available.

People may wonder why this matters so much. If you setup is small then the chances of the master or intermediate master failing are also quite low. As your environment grows so does the chance of a failure occurring. I see failures, sometimes more than once a day, and prefer orchestrator to be running so I do not need to have to deal with these failures manually.

Monitoring Orchestrator

What’s required to monitor orchestrator? Basically you want to monitor the orchestrator process is working and the http web interface (especially if you run several app servers) on each of the boxes individually.

Orchestrator itself also supports graphite and can provide you information on internal activity such as the number of successful or failed discovery processes (polling MySQL servers) and also read and write activity to the backend MySQL store. However if you’re not using graphite this is more tricky.

I have made some code changes to provide further more detailed metrics on the time taken to poll and check each of the monitored MySQL servers as I had experienced some load issues due to the number of servers being monitored and these timing metrics helped identify where to focus to fix this. These metrics are available via a raw http api call and for simplicity aggregate values can be retrieved for the last few seconds. This makes tying into any external monitoring system much easier.

Some of these patches have been passed back upstream to github and further patches should arrive shortly. However, adding these metrics allowed me to identify bottlenecks in orchestrator when monitoring a large number of systems and together with colleagues performance enhancements for this sort of situation have been fed back upstream.

Summary

I hope that this article helps provide a bit more insight into what might be worth thinking about when setting up orchestrator for the first time in a production environment. Feel free to contact me if more detail is needed or something is not clear enough.

I rarely talk explicitly about where I work (booking.com). However, I do enjoy it. We are lucky to have just completed our wonderful annual event where we all come together and share some time, not only with direct colleagues, but also with colleagues from other offices we see much less frequently. This video in many ways represents what brings us together and what makes this all so special. I hope you enjoy it.

The Madrid MySQL Users Group has its next meeting on Tuesday, 22nd November 2016. Giuseppe Maxia will be giving a presentation MySQL document store: SQL and NoSQL united and I’ll be providing a brief summary of the new MySQL 8.0 and MariaDB 10.2 beta versions which were announced recently. There will also be an opportunity to discuss topics related to MySQL. Hope to see you there.

Oracle Open World 2016 has just finished in San Francisco and we are now about to embark on Percona Live Europe in Amsterdam.

I offered a presentation in San Francisco on the MySQL X protocol, the new protocol that Oracle is using to make the DocumentStore work. This new protocol also allow you to send normal SQL queries to it, and it looks like Oracle has plans to use it in more scenarios.

I posted recently Lossless RBR for MySQL 8.0 about a concern I have about moving to minimal RBR in MySQL 8.0.This seems to be the direction that Oracle is considering, but I am not sure it is a good idea as a default setting.

I talked about a hypothetical new replication mode lossless RBRand also about recovery after a crash where perhaps the data on the slave may get out of sync with the master. Under normal circumstances this should not happen but in the real world sometimes it does.

Note: I’m talking about an environment that does not use GTID.GTID is good but may have its own issues and it’s probably best to leave those discussions to another post.

So let us talk about the difference between IDEMPOTENT mode (slave_exec_mode=IDEMPOTENT) and what I’ll call AUTO-REPAIR mode, mentioned in feature request bug#54250 to Oracle in 2010. By default the DBA wants to avoid any data corruption, so this should be the default behaviour. Thus I’d prefer auto-repair mode to be off by default, stopping replication if any inconsistencies are found. I could enable it if I see such an issue as it should help me recover the state of the database without adding further “corruption” to the slave.

If I’m confident that this procedure works fine and I’m monitoring the counters mentioned below then it may be fine to leave enabled all the time.

A slave fails, it may crash and it recovers. It’s likely that the replication position it “remembers” is behind the actual state in the database.

If we use full RBR (default setting) in these circumstances then we may get in a set of changes which the SQL thread tries to apply.

They’ll be in the form of:

before row image / after row image

before row image / after row image

…

where each row image is the set of column values prior to and after the row changes.Traditionally we use the abbreviations BI and AIfor this.

Currently the SQL thread will start up and look for the first row to change and once it has found it change it.If the exact matching conditions it needs can not be found then an error will be generated and replication stops.

IDEMPOTENT mode attempts to address this and tries to “continue whatever the cost”. To be honest I’m not exactly sure what it does, but it’s clear that it will either do nothing or perhaps it might try to find the row by primary key and update that row. I’d expect it probably does nothing.

See a comment later on.So I did go and check and the comments in slave_exec_modesay that it suppresses duplicate-key and no-key-found errors. There is no mention of updates where the full AI is unavailable. (e.g. when using minimal RBR)

It also looks like it does not “repair” the issue, but it simply ignores it. The documentation is not 100% clear to me.

I made a comment about different options for AUTO-REPAIR mode and when it can work and when it can not. In FULL RBR mode it should always be able to do something. In MINIMAL RBR mode there will be cases when it can not. Let’s see the case of FULL RBR mode:

For an UPDATE when the requested row can not be found:

auto-repair mode would INSERT the row. You have a full AI so you can do this safely.

A counter should be updated to record this action.

For a DELETE row operation when the row can not be found:

auto-repair mode would ignore the error and given the row does not exist anyway the effect of the DELETE has already been accomplished.

A counter should be updated to record this action

For an INSERT row operation when the ROW already exists.

Duplicate key insert) This is what generally breaks replication.

auto-repair mode would treat this as an UPDATE operation (based on the primary key in the table) and ensure the row is changed to have the same primary key and the columns of the AI.

Again a counter should be updated to record this action.

In FULL RBR mode these 3 actions should allow replication to continue. The database is no more corrupt than it was before. In fact it’s in a state that’s somewhat better.

In many cases other row events will proceed as expected without issue: INSERTS will happen, UPDATES and DELETEs to existing rows will work as the row is found, and things will proceed as normal.

So should we get in a situation like this we can check the 3 counters and this gives us a clue as to the number of “repair actions” which MySQL has had to execute.It also gives us an idea of how inconsistent the slave seems to be, though those inconsistencies should now have been removed.

As I said I can’t remember exactly what IDEMPOTENT mode does in these 3 circumstances.It may do something similar to my AUTO-REPAIR mode or it may just skip the errors.

Why don’t I know?Well I’m currently in a plane and the mysql documentation is not provided with my mysql server software and I’m not online so I can’t check.I used to find the info file or a pdf of the manual quite helpful in such situations and would love to see it put back again so I don’t need to speculate about what the documentation says.

Yes, I could update this text when I’m back online, but I think I’ll make the point and leave this paragraph here.

So with FULL RBR the situation seems to me to be clear. IDEMPOTENT mode may not do the same thing as the AUTO-REPAIR mode, and whether it does or not there are no counters to see the effect it produces on my server. So I’m blind. I do not like that.

Let’s change the topic slightly and now switch to MINIMAL RBR and do the same thing. In theory now IDEMPOTENT mode and AUTO-REPAIR mode may seem to be the same (assuming IDEMPOTENT mode changes what it can) but that’s also not entirely true.

With minimal RBR mode we get a set ofprimary key plus changed columns for each row that changes. For INSERTS we get the full ROW and for DELETES we only need the primary key. That should be enough.

What changes here are the UPDATES: as if we don’t get the full row image we can not know what was in the table before. We only have information on the new data.So other columns which are not mentioned are unknown to us. If we are UPDATING a row and we can not find it, an INSERT is not possible as we do not have enough information to complete the columns that are unknown to us. So replication MUST stop if we want to avoid corruption.

Additionally, with minimal RBR UPDATES even if you find the ROW to UPDATE you can not be sure you are doing the right thing as you have no reference to the content or state of the before image. My thought here was that the ideal thing would be to send with each row a checksum of the row content on the master.This would be “small” (so efficient) and could be checked against the row content on the slave prior to making the update.If the values match we know the RBR UPDATE is working on expected data.This makes a DBA feel more comfortable.

Table definitions on a master and its slaves are not always identical.There are several reasons for this such as the fact that different (major) versions of MySQL are being used, or simply due to it being impossible to take downtime on the server some sort of out of band ALTER TABLE may have been run on the slave and that change is still pending on the master. The typical case here is adding new columns, or changing the type, width, character set or collation of an existing column. In these circumstances the binary image on the master and slave may well not be the same so the before row image “checksum” on the master would not be usable.To detect such a situation it may be necessary to also send a table definition checksum with the row before image checksum, though this could be sent for each set of events on a table not each row. The combination of the two values should be enough to allow us to be ensure that minimal RBR changes can be validated even if we do not push down a full before image into the binlog stream. Again, if the definitions do not match it would seem sensible to update a counter to indicate such a situation. We probably do not want to stop replication in this situation. Those who do not expect any sort of differences between master and slave may be paranoid enough to want to not continue, but I know for my usage I’d like to monitor changes to the counter but probably just continue.

Even my proposed LOSSLESS RBR would need this checksum to be safe as it would not contain the full before image but only the PK + all columns for an UPDATE operation, so potentially “slave drift” might happen and go undetected.

I can see therefore that optionally being able to add to minimal- and lossless-RBR such checksums would be a good way to ensure that replication works safely and pushes out changes to the slaves which are expected, and catches unexpected inconsistencies.

The additional counters mentioned would help “catch” the number of inconsistencies that take place and they would be good even with the current replication setup when IDEMPOTENT mode is used. This lack of visibility of errors should make most DBAs rather sleepless, but I suspect there are those that are not aware and those that just have to live without that knowledge. Having these extra counters would help us see when things are not the same and allow us to take any necessary action based on that information should it be necessary.

I hope with this post I have clarify why IDEMPOTENT mode is not the same as my suggested AUTO-REPAIR mode and when it’s safe to continue replicating and when it is not under a variety of different conditions which would normally make RBR stop.

It also seems clear to me that MINIMAL RBR would benefit from some additional checksums to allow the DBA to be more confident that the changes being made on the slave match those made on the master. This is especially so if using minimal RBR.

The use of minimal RBR is an optimisation, it is done deliberatelyin busy environments where the size of written data is large and it is not convenient/possible to keep all data.Additionally the performance of full RBR can in some cases be an issue especially for “fat/wide” tables etc. It is true that minimal RBR helps considerably here. There are several issues it resolves:

reduces network bandwidth between master and slaves

reduces disk i/o reading / writing the binlog files

reduces disk storage occupied by said binlogs

There was also a comment about enabling IDEMPOTENT mode by default on a slave.

This is a mode which basically ignores most errors. That does not seem wise. As a DBA by default you want the server to not lose or munge data. There are times when you may decide to forgo that requirement, but the DBA should decide and the default behaviour should be safe.

Thus the idea of lossless RBR came to mind. What would this involve compared to the current modes of FULL or MINIMAL RBR?

INSERTs are unchanged (as now): you get the full row

DELETEs are as per minimal RBR: The primary key is sent and the matching row is removed. IFF on a slave the pks differed and more than one row would be deleted this should be treated as an error.

UPDATEs: Send the pk + full new image, thus ensuring that all new data is sent. This reduces the event size by ~ 1/2 so would be especially good for fat tables and tables where large updates go through. If the PK columns do not change then it should be sufficient to send the new row image and pk column names etc

Related to this behaviour it would be most convenient to implement an existing FR (bug#69223) to require that table definitions via CREATE/ALTER TABLE MUST HAVE a PK. I’ve seen several issues where a developer has not thought a primary key was important (they often forget replication) and this would trigger problems. Inserts would work fine but any updates that happened afterwards would trigger a problem, not on the master but on all slaves. I think that by default this behaviour should be enabled. There may be situations where it needs to be disabled but they are likely to be rather limited.

This new mode LOSSLESS RBR is clearly a mix between full and minimal and it ensures that data pushed into a slave will always be complete.I think that is a better target to aim for with MySQL 8.0 than the suggested MINIMAL RBR.

You may know that I do not like IDEMPOTENT mode much. I have created several FRs to add counters to “lost/ignored events” so we can see the impact of using this mode (usually it is used after an outage to keep replication going even if this may mean some data is not being updated correctly. Usually this is better than having a slave with 100% stale data.)

I would really also like to see you adding a “safe recovery mode” where statements which won’t damage the slave more are accepted.

UPDATEs with non-matching columns: update what you can. (This is likely to happen with full RBR as minimal RBR should never generate this type of error.)

[ For each of these 4 states: add counters to indicate how many times this has happened, so we can see if we’re “correcting” or “fixing” errors or not. ]

You’ll notice that lossless RBR would work perfectly with this even after a crash as you’ll have all the data you need, so you’ll never make the state of the database any worse than it was before.

I would like to see the FRs I’ve made regarding improving RBR being implemented as whether lossless RBR becomes a new replication mode or not they would help DBAs both diagnose and fix problems more easily than now.

It is probably also worth noting that FULL RBR is actually useful for a variety of scenarios, for example for exporting changes to other non-MySQL systems. We miss for this the definition of tables, and current systems need to extract that out of band which is a major nuisance. Exporting to external systems may not have happened that frequently in the past, but as larger companies use MySQL this becomes more and more important. For this type of system FULL RBR is probably needed even though it may not be used on the upstream master. I would expect that in most cases LOSSLESS RBR would also serve this purpose pretty well and reduce the replication footprint. The only environment that may need traditional FULL RBR is where auditing of ALL changes in a table is needed and thus both the before and after images are required.

Is it worth adding yet another replication mode to MySQL? That is a good question and it may not be worth the effort. However the differences between FULL and LOSSLESS RBR should be minimal: the only difference is the amount of data that’s pushed into the binlog so the scope of changes etc should be more limited. Improving replication performance seems to be a good goal: we all need that, but over-optimising should be considered more carefully. I think we are still missing the monitoring metrics which help us diagnose and be better aware of issues in RBR and the “tools” or improvements which would make recovery easier. Unless you live in the real world of systems which break it is hard to understand why these “obscure” edge cases matter that much. The responses like: “just restart mysqld” may make sense in some environments, but really are not realistic in systems that run 24x7x365. With replication it is similar: stopped replication is worse than replication that is working, but where data may not be complete. Depending on the situation you may tolerate that “incomplete data” (temporarily) while gaining the changes which your apps need to see. However, it is vitally important to be able to measure the “damage” and that is why counters like the ones indicated above are so vital. It allows you to distinguish 1 broken row, or 1,000,000 and decide on how to prioritise and deal with that as appropriate.

While I guess the MySQL replication developers are busy I would certainly be interested in hearing their thoughts on this possible new replication mode and would definitely prefer it over the suggested minimal RBR as a default for 8.0. Both FULL and MINIMAL RBR have their place, but perhaps LOSSLESS would be a better default? What do you think?

When trying out new software there are many other questions you may ask and one of those is going to be the one above. The answer requires you to have built your software to capture and record low level database metrics and often the focus of application developers is slightly different: they focus on how fast the application runs, but do not pay direct attention to the speed of each MySQL query they generate, at least under normal circumstances. So often they are not necessarily able to answer the question.

I have been evaluating MySQL 5.7 for some time, but only since its change to GA status has the focus has switched to check for any remaining issues and also to determine if in the systems I use performance is better or worse than MySQL 5.6. The answers here are very application and load specific and I wanted a tool to help me answer that question more easily.

Since MySQL 5.6, the performance_schema database has had a table performance_schema.events_statements_summary_by_digest which shows collected metrics on normalised versions of queries. This allows you to see which queries are busiest and gives you some metrics on those queries such as minimum, maximum and average query times.

I used this information and built queryprofiler to allow me to collect these metrics in parallel from one or more servers and thus allow me to compare the behaviour of these servers against each other. This allows me to answer the question that had been nagging me for some time in a completely generic way. It should also work on MariaDB 10.0 and later though I have not had time to try that out yet.

queryprofiler works slightly differently to just querying P_S once. It takes several collections of the data, computes deltas between each collection thus allowing you to know things like the number of queries per second which events_statements_summary_by_digest does not tell you. (There is no information in performance_schema telling you when the collections start. That is something I miss and would like to see fixed in MySQL 5.8 if possible.)

The other difference of course is that P_S gives you information on one server. If you collect the information at the same time from more than one server with a similar load then the numbers you get out should be very similar and that is what queryprofiler does.

How do you use queryprofiler? Provide it with one or more Go-style MySQL DSNs to connect to the servers and optionally tell it how many times to collect data from the servers (default: 10) and at what interval (default: every second) and it will run and give you the results, telling you the top queries seen (by elapsed time of the query) and the metrics for each server (queries per second, average query latency and how much these values vary).

MySQL 5.7 GA was released a couple of months ago now with 5.7.9 and 5.7.10 has been published a few days ago. So far initial testing of these versions looks pretty good and both versions have proved to be stable.

I have, however, been bitten by a couple of gotchas which if you are not aware of them may be a bit of a surprise. This post is to bring them to your attention.

New MySQL accounts expire by default after 360 days

This is as per documentation, so there is no bug here. MySQL 5.7 provides a new more secure environment. One of the changes is to add password expiry and the default behaviour is for passwords expire after 360 days. This seems good, but you, perhaps like me, may not be accustomed to managing your passwords, checking for expiration and adjusting the MySQL user settings accordingly. The default setting of default_password_lifetime is 360 days, so after upgrading a server to MySQL 5.7 from MySQL 5.6 this setting suddenly comes to life. The good thing is nothing happens immediately so you do not see the time bomb ticking away. I had have been testing the DMR versions of MySQL 5.7 earlier to the GA release and consequently using it for longer than 2 months. Recently a couple of 5.7.9 servers which had been upgraded from 5.6 a year ago decided to block access to all applications at the same time. The quick fix is simple: change the default setting to 0 (no expiry) and we have a configuration that behaves like MySQL 5.6 even if it less secure than the default MySQL 5.7 setup. We can then look at how to manage the MySQL accounts and take this new setting into account in a more secure manner. If you are starting to use MySQL 5.7 and are not migrating from 5.6 then perhaps you’ll put in the right checks in place when you start, but those of us migrating from 5.6 can not push down grants with the new ALTER USER syntax until the 5.6 masters are upgraded so we need to pay more attention to this while in the progress of migration.

New range optimizer setting might cause unexpected table scans if not set properly

MySQL 5.7.9 GA added a new configuration variable: range_optimizer_max_mem_size, set by default to 1536000. The documentation does not say much about this new setting and seems quite harmless. “if … the optimizer estimates that the amount of memory needed for this method would exceed the limit, it abandons the plan and considers other plans.” The range optimiser is used for point selects, primary key lookups and other similar queries. What this setting does is after parsing a query look at the number of items which may be referenced in a WHERE clause and if the memory usage is too high fall back to a slower method.

Let’s put this into context. A query like SELECT some_columns FROM some_table WHERE id IN ( 1, 2, 3, ... big list of ids ... 99998, 99999 ) will trigger this limit being reached for a large enough range of ids. DELETE FROM some_table WHERE (pk1 = 1 AND pk2 = 11) OR (pk1 = 2 AND pk2 = 12) .. OR .. (pk1 = 111 AND pk2 = 121) /* pk1 and pk2 form a [primary] key */ would also potentially trigger this.

The questions that come out of this are (a) “How to figure out the point at which this change happens?”, and (b) “What happens at this point?”

The answer to (b) is simple: MySQL falls back to doing a table scan (per item). The answer to (a) is not so clear. Bug#78752 is a feature request to make this clearer, and further investigation pointed to MySQL 5.6’s previous behaviour where the limit was defined in terms of a fixed number of hard-coded “items” (16,000), whereas 5.7’s new behaviour is in terms of memory usage. The relationship between the two settings is not very clear and initial guestimates on systems I saw issues with seems to indicate that maybe 4kB per item is used by MySQL 5.7 at the moment. The point is that what worked quickly as point selects on 5.6 may fall back to table scans per item in 5.7 if the number of entries is too high, and this would require a reconfiguration (it is dynamic) of the configuration setting mentioned. The bad behaviour may also only happen depending on the size of the query.

Many people may wonder why anyone would be mad enough to use a SELECT or DELETE statement with several thousand entries in an IN () clause, but this comes from having split data in a single server into two and making the application find a list of ids from one server using some criteria and then using the ids obtained in a different one. I see that pattern used frequently and it is probably a common pattern on any system where data will no longer fit in a single server.

The problem with this particular change in behaviour is that point selects are very fast and efficient in MySQL. People use them a lot. Table scans are of course really slow, so depending on the query in question performance can change from ms to minutes just because your query is a tiny bit bigger than the new threshold. In practice it looks like the old hard-coded limit and the new dynamic limit are at least an order of magnitude different in size so it is quite easy to trip up on good queries in 5.6 failing miserably in 5.7 without a configuration change. Again while migrating from MySQL 5.6 to 5.7 you may see this change bite you.

You may get caught by either of these issues. I got caught by both of them while testing 5.7 and while the solutions to resolve them are quite simple to fix they do require a configuration change to resolve the issue. I hope this post at least makes you recognise them and know where to poke so you can make your new 5.7 servers behave properly again.

English: Madrid MySQL Users Group will be holding their next meeting on Tuesday, 10th November at 19:30h at the offices of Tuenti in Madrid. David Fernández will be offering a presentation “MySQL Automation @ FB”. If you’re in Madrid and are interested please come along. We have not been able to give much advance notice so if you know of others who may be interested please forward on this information. Full details of the MeetUp can be found here at the Madrid MySQL Users Group page.

English: Madrid MySQL Users Group will be holding their next meeting on 17th June at 18:00h at EIE Spain in Madrid. Dimitri Vanoverbeke and Stéphane Combaudon from Percona will be offering two presentations for us:

Practical MySQL optimisations

Galera Cluster – introduction and where it fits in the MySQL eco-system

I think this is an excellent moment to learn new things and meet new people. If you’re in Madrid and are interested please come along. More information can be found here at the Madrid MySQL Users Group page.

I will be presenting (in Spanish) a quick summary of Percona Toolkit and also offering a summary of the new features in MySQL 5.7 as the release candidate has been announced and we don’t expect new functionality.

This is also an opportunity to discuss other MySQL related topics in a less formal manner.

In November last year I announced a program I wrote called pstop. I hope that some of you have tried it and found it useful. Certainly I know that colleagues and friends use it and it has proved helpful when trying to look inside MySQL to see what it is doing.

A recent suggestion provoked me to provide a slightly different interface to pstop, that is rather than show the output in a terminal-like top format, provide a line-based summary in a similar way to vmstat(8), pt-diskstats(1p) and other similar command line tools. I have now incorporated some changes which allow this to be done. So if you want to see every few seconds which tables are generating most load, or which files have most I/O then this tool may be useful. Example output is shown below:

Hopefully this gives you an idea. The --help option gives you more details. I have not yet paid much attention to the output and the output is not currently well suited for a tool to parse, so I think it’s likely I will need to provide a more machine readable --raw format option at a later stage. That said feedback on what you want to see or patches are most welcome.

Madrid MySQL Users Group will have its next meeting on Thursday, the 29th of January.

I will be giving a presentation on the MySQL binlog server and how it can be used to help scale reads and be used for other purposes. If you have (more than) a few slaves this talk might be interesting for you. The meeting will be in Spanish. I hope to see you there.

Madrid MySQL Users Group will have its next meeting on Tuesday, the 18th of December. Details can be found on the group’s Meetup page here: http://www.meetup.com/Madrid-MySQL-users-group/events/219081693/. This will be meeting number 10 of MMUG and the last meeting of the year. We plan to talk about MySQL, MariaDB and related things. An excuse to talk about our favourite subject. Come along and meet us. The meeting will be in Spanish. I hope to see you there.

I have been working with MySQL for some time and it has changed significantly from what I was using in 5.0 to what we have now in 5.6. One of the biggest handicap we’ve had in the past is to not be able to see what MySQL is doing or why.

MySQL 5.5 introduced us to performance_schema. It was a good start but quite crude. MySQL 5.6 gave us a significant increase in stuff that allows you to see what is going on inside MySQL. That’s great, except it’s hard to read, the documentation is good, but not oriented at the DBA but more at the MySQL developer (that’s what it seems like at least). So most of us have ignored it. Others complained about the overhead and said it’s not good to use it.

Mark Leith developed mysql-sys as a way to see this great information in a more usable way. It’s only a set of views so doesn’t really have much overhead. However, one thing I missed was getting the information of what was happening inside performance_schema in real-time, top-like, so I could see where a server was busy, and what it was doing. So inspired by mysql-sys and also as a way for me to start playing with go I have built P_S top, or pstop.

What does pstop show you? It takes some counters from performance_schema and subtracts the values from when it started up. The output is in four different screens which you toggle between using the <tab> key. The idea is to look at the total latency (wait time) and order by table or file that causes it in heaviest first. Table waits are also then split between read, insert, update and delete and there’s a screen which shows some locking information.

Access to the db server is currently via a ~/.my.cnf defaults file. I probably need to make this more sophisticated, and allow the credentials to be provided directly but have not done that yet. I have used this on a couple of systems which I monitor for work and it has been most informative in showing where the load is, which table or file generates it and how that varies over time. This information was already in performance_schema but there have not been any tools to get this out.

With the upcoming release of MySQL 5.7 I begin to see a problem which I think needs attention at least for 5.8 or whatever comes next.

The GA release cycle is too long, being about 2 years and that means 3 years between upgrades in a production environment

More people use MySQL and the data it holds becomes more important. So playing with development versions while possible becomes harder. This is bad for Oracle as they do not get the feedback they need to adjust the development of new features and have to best guess the right choices.

Production DBAs do want new features and crave them if it makes our life easier, if performance improves, but we also have to live in an environment which is sufficiently stable. This is a hard mixture of requirements to work with.

In larger environments the transition from one major version to another, even when automated can take time. If any gotcha comes along then it may interrupt that process and leave us with a mixed environment of old and new, or simply in the state of not being able to upgrade at all. Usually that pause may not be long but even new minor versions of MySQL are not released that frequently so from getting an issue fixed to seeing it released and then upgrading all servers to this new version is again another round of upgrades.

I would like to see Oracle provide new features and make MySQL better. They are doing that and it is clear that since I have been using 5.0 professionally up to the current 5.7 a huge amount has changed. The product is much more stable and performs much better, but my workload has also increased so I am still looking for more features and an easier life. I am an optimist that is for sure.

One issue that I believe holds back earlier experimentation is that MySQL is not modular. Even the engines that you can use in it, if built as plugins, do not seem to be switchable from one minor version to another.

This leads to 2 issues:

any breakage or bug (and all software has bugs, that is inevitable) requires you when it is fixed to upgrade to a new version. That new version has changes in many different components. Sometimes that is fine but sometimes that may bring in new bugs which cause their own problems

potentially the developers of MySQL could replace a “GA module” with a more experimental version of that module which maybe has more features, could perform better but maybe breaks. Changing a single module is hopefully much safer than changing a full binary for a development version, and that should be much easier to do on spare machines. A module such as this would be something I could much more easily test than installing 5.7.4 on lots of machines.

However, the problem is that MySQL is not modular and that is where several people have explained to me my madness and how hard it is to achieve things like this. My current employer likes to push out changes in small chunks, look at the result of those small changes and then if they seem good, go ahead and do more. If something goes wrong, back it out and look elsewhere to do things. Doing the same on a database server not designed that way may well be hard, but making small changes along these lines would I think longer term help improve things and give the people that use a GA MySQL the opportunity to try out new ideas, give feedback quickly and allow things to evolve.

Inevitably when you start to build interfaces like this some interfaces need to change to allow to allow for a larger redesign of the innards of a system. That is fine, when it happens we’ll move over to that and a DEV version will have these new much improved features and we may have to wait longer for that.

What modules might I be talking about when I talk about modularising MySQL? I’ll agree I do not know the code other than having glanced at it on several occasions but there are some quite clear functional parts to MySQL:

the engines have often been plugins, though now InnoDB is a bit of an exception. I still wonder if that is necessary whatever MySQL’s design. However these plugins do not seem to have a completely clear interface with MySQL as I have seen plugins for example for something like Spider or TokuDB which work for a specific MySQL or MariaDB version. That just shows that whatever this interface is it is not designed to be stable and swappable between different MySQL minor versions. Doing something to make that better would mean that people who build a new engine can build it once for a a major version and know that on binaries built the same way the files they produce should just plug in without issue unchanged. Me dreaming? Perhaps but no-one worries if I upgrade my db4 rpm from 4.7.25 to 4.7.29 that all the applications that use it will break: the expectation is clear: it should not make any difference at all. Why does something like this not work with MySQL engine code?

logging has been rather inconsistent for a long time. I think it may improve in 5.7, but however it’s built, build it as a module. If I want to replace that module with something new that stores all my log data in a Sybase or DB2 database MySQL should not care, assuming the module does the right thing and there are settings to configure this appropriately. The point being also that if there is a bug in the logging, the bug can be fixed and the module replaced with a bug-free version, without necessarily requiring me to upgrade the whole server.

Replication is generally split into 2 parts: the writing to binlogs and the reading of those binlogs from a master, storing them locally and reloading the relay logs and processing them.

I have seen bugs in replication, mainly in the more complex SQL thread component where the same change could potentially apply. Swap out the module for a fixed one.

MySQL 5.6 was supposed to make life great with replication and we would not get stuck in a situation where a crashed server would come up, out of sync with its master, and because of that we would need to reclone the server again. Even when moving over to using the master_info_repository and relay_log_info_repository settings to TABLE you can have issues. The quick fix implemented by Oracle of relay_log_recovery = 1 sounds great. It is a quick, cheap and cheerful solution which works assuming you never have delayed slaves. Different environments I maintain do not follow this pattern and I have servers with a deliberate multi-hour delay, which can be useful for recovering from issues. Also copying large databases between datacentres may take several days, triggering after starting the system a need to pull logs and process them for several days. A mistaken restart would lose all that data and require it to be downloaded again which is costly. So I have discussed with colleagues a theoretical improved behaviour of the I/O thread should MySQL crash but there is no way to test it on boxes I currently use. Making the I/O thread into a module would make it much easier to try out different ideas on GA boxes to show whether these ideas are really workable or not.

The query parser and optimiser in MySQL is supposed to be a horrendous beast that everyone must keep clear of. Improvements are happening and posts like this are an indication of progress. My understanding is that this beast is spread all over the server code and thus hard to untangle but certainly from a theoretical point of view doing so would allow alternative optimisers to be usable/pluggable, and for example different optimisers might be better at handling workloads such as batch based workloads with sub queries and such which MySQL is known not to handle well, but which for certain workloads could potentially make a great deal of difference to us all. The MySQL of 5.0 is quite different from the MySQL of today and sharding is the norm, but that requires help from the app to do all the dirty work. Other options are to use something like Vitess, ScaleBase, or Spider, or some built-in new module which knows about this type of thing better and can do this sort of stuff transparently to the application. MySQL Fabric tries to do this at the application level and that’s fine, but it adds much more complexity for the application developers who probably should not really have to worry (too much) about this type of detail. So solving the problem is not the issue here, it’s providing hooks to let others try, or simply to swap out version 1 with version 10, and see if version 10 is better and faster, with everything else unchanged.

The handling of memory in MySQL has always been interesting to us all. Each engine has traditionally managed the memory it needs itself and there is no concept of sharing, or memory pressure, all of which can lead to sudden memory explosions due to a changing workload which may kill mysqld (Linux OOM) or trigger swapping (database servers should never swap…). I have seen in 5.7 that there is now some memory instrumentation and this at least allows looking to see where memory is used. The next step would be to use the same memory management routines, and finally perhaps to add this concept of memory pressure allowing a large query if needed to page out or reduce the size of the innodb buffer pool while it is running, or the heavy use of some MyISAM or Aria tables could do the same. Doing that is hard, but we are no longer using a MySQL “toy” database. Many large billion $ companies depend on MySQL so this sort of functionality would be most welcome there I am sure. Changes in this area would certainly need to be done cautiously but I can envisage swapping out the default 5.8 memory manager for a “new feature” 5.9 version with all the “if it breaks you keep the bits” warnings attached, allowing us to see if indeed problematic memory behaviour is resolved by this new module.

The event scheduler is in theory a small and tiny component which does it’s thing. An early version of 5.5 had some bugs and I had to wait a long time to upgrade the server just to fix this pesky event_scheduler module which all it does is send out heartbeat changes used for measuring replication delay. Had this been a module I could have installed a fixed version and not had to use a work around for several months.

I am sure there are lots of other components of MySQL which could receive the same treatment.

Making these sort of changes is of course a huge project and most managers do not see the gain of this, certainly not short term. However, if care is taken and as different subsystems are modified there is an opportunity for making progress and allowing the sort of experimentation I describe. Also, and while Oracle may not see it this way, having a clearer interface and more modular framework would allow others to perhaps try different things, and replace a module with their own. Oracle do seem to be putting a lot of resources into MySQL and that is good, but they do not have infinite resources and they can not solve specialised or every need that we might see. Making it easier, for those who can, to use this hypothetical modular framework, provides an opportunity for some things to be done which can not be done now. Add a bounty feature and let people pay for that and where something is modularised it will be much easier for them to try to solve problems that may come up. In any case, later testing will be easier if these interfaces exist.

This is the way I would like to see MySQL improve, notice I do not actually talk about functional improvements, but how to make it potentially easier to experiment and test these new features. This sort of design change would allow those of us that need new features now to test and perhaps include them in our GA versions. Maybe then the definition of GA will become rather vague if I am using 5.7.10 + innodb 5.8.1 + io_thread_5.8.3 + sql_thread_5.8.6 + event_scheduler….. Support will probably hate the suggestion I have just made as it would potentially make their life more challenging, but then again I do not see most people playing this game. It is meant for those of us who need it, and if not needed at all bug fixing specific issues should be much easier than now, where you need to do a full new test on a new version to make sure you do not catch another set of new bugs.

If you have got to the end of this thanks for reading. I need to learn to write less but I do believe that the reasoning I make above makes a lot of sense. This can only be done with small changes and with people seeing the idea and trying it out, and at least initially doing it on parts of the system which are easy to do. If they work further progress can be made.

Oracle and MariaDB both want feedback and ideas of where we want MySQL / MariaDB to go. Independently of some of the technical aspects of new features and improvements this is my 2 cents of one thing I would like to see and why.