Tag: rant

There was one MySQL feature that I was always very afraid to ask for. I sure needed it for past 15 years of my MySQL practice, and yet I never got to it. I wanted MySQL processlist (‘SHOW PROCESSLIST’, later – information_schema.processlist) to have more accurate query execution time.

I submitted a patch for that to MongoDB (and it got merged and released really quickly). I couldn’t admit to myself and others that MySQL does not have this functionality, even though it is hard to reason about systems in production without such data.

When 99.999% of queries happen within 1s, one has to resort to statistical analysis of zeroes and ones to determine how long they may be running (that is, if nine queries are at 0s and one is at 1s, there’s a chance that all of them are running for 0.1s). Unfortunately, this kind of statistical analysis is not feasible in runtime environment when dealing with issues at hand.

One of reasons why I did not submit this feature request is because I did not want to be subjected to the embarrassment of not understanding MySQL Release Cycles, Stability and Performance Architecture.

Someone else (Hi, Simon!) decided to ask for this in late 2014.
By 2015 spring MySQL engineering team responded with “thank you for your feature request”.
Few months later engineering team wrote that they won’t be improving tools like processlist and instead will change behavior of performance_schema tables (which were not useful at that time anyway).

So, even though MySQL already shows time based on arithmetics of subtracting query start time from current time, having the tiny improvement on top of that was not deemed the right way, because, I guess, it doesn’t fit the Performance Vision.

So, I’m confused. I’d like to see “SHOW PROCESSLIST” expanded to have this. Over time you learn the quirks and differences between it and I_S.PROCESSLIST that was added later in 5.1 (for example, one of them will truncate queries at zero-bytes, other will truncate queries at invalid unicode, even if data in queries is all binary). The whole truncation hierarchy of “SHOW PROCESSLIST” -> “I_S.PROCESSLIST” -> “SHOW FULL PROCESSLIST” deserves a separate discussion (rant?).

It already has performance_schema.threads table, which has same second-level precision on the PROCESSLIST_TIME column. It has few additional columns over standard processlist although it has a very interesting behavior – it doesn’t show prepared statement texts in PROCESSLIST_INFO, so that information is truncated to 0 bytes (regular queries are truncated at 1024 bytes). So we have a third place to look for information, it was added in newer release, is Much Better (I don’t know in what way though) than existing INFORMATION_SCHEMA and SHOW options.

Apparently the real place to look at this (with bugs fixed in latest release) is performance_schema.events_statements_current.

Timing data is in picoseconds so everything has to be divided by a trillion to get a meaningful number. I can mentally handle microseconds or milliseconds, but dealing with 17-digit-numbers is not for my simpleton mind. This is not supposed to be used directly and one has to use specially built tools to access this data or write their own layers of views.

It won’t have user or schema information, so you’re forced to join to another table (threads or I_S). The sad part of this is that there’s no indexing/direct access methods provided and MySQL will use same methods as for any other non-indexed joins… The query would look something like:

Now that I got to actual source of data I was wondering how useful it is in production environment. The answer is “not much”. Apparently if you have few hundred queries running MySQL will be writing to gigabytes of memory courtesy of Performance Schema SQL digests feature even with said feature disabled.
I filed a bug here, but was still confused.

It looks that the way to answer any idea how to improve MySQL DBA experience is by throwing in more expensive, buggy, complicated features, that are painful or impossible to use and wave a flag of “nobody complained but you”.

So I guess I can expose myself to more embarrassment, file same bug again. I really can’t believe that current implementation is supposed to be helpful and useful to DBAs. I guess someone does. Unfortunately, the only time they try their features is when they have to write a blog post how to use it.

P.S. We use either instrumentation on client side or our slocket – slow-log-datagram-socket – interface to do ad-hoc aggregations with high precision timings. I have no idea why we didn’t add direct high precision processlist ourselves.

Warning: this is a mixture of historical content, biases, stupid marketing and unknown/proprietary/closed source technologies. Proceed with caution.

NuoDB marketing was sending out this message, encouraging me to blog (they were looking for bloggers too):

And while Facebook sharded MySQL 4000 times, even they call it a “fate worse than death.”

We’ve seen this phrase before and it did not come from us. For whatever reason NewSQL echo chamber is repeating this with less and less truth in it. In various whitepapers (all behind registration walls) they mention some analyst estimates and try to put a parallel between operating costs of large companies and something a new developer would do, as if everyone is living under same constraints.

I don’t know if NuoDB is a good technology for the customer they’re targeting, all their diagrams essentially say “we have blocks of magic and we multiply them”, and if you approach them at a conference, their “tech guy is away at the moment”. Still, the key term around is that what they do is Holy Grail of databases and we should believe in that.

It is still a question whether NuoDB does solve problems of massive scale web deployments. They seem to diss existing operational environment costs with “thousands of servers and storage are required” and I’m not sure what the cost of their alternative is.

We’ve revealed some aggregate numbers of our MySQL based data platform before – there’re tens of millions of queries (billions at cache level), millions of IOPS, and it is somewhat difficult to squeeze that into less than “thousands of servers”.

There’re more than billion users to serve and sheer amount of data in the social graph is also not something you can put on a few thumb drives. If only any of these software vendors could tell how much their platform would cost in such a case.

I am not an expert at optimistic concurrency control that seems to be in there – I have yet to see a large scale system using it. Optimistic concurrency control (the use of “control” here sounds like an oxymoron) means that if users talking to different servers do same operation, one of them is going to get an error on commit (instead of waiting for the lock release and doing his job on top). This also cannot hide any latencies, if consistency is required. Unfortunately, retries in higher latency environments are even more expensive and writing software for such model becomes more complicated than writing software for sharded datasets.

Software that does not have to be aware of sharding and underlying partitioning is easier to implement. Unfortunately, writing efficient software on top of distributed datasets is not that easy. Fan-out in distributed systems is expensive (especially at thousands of machines level) and is not an operation that should be done a lot in web-facing environments.

Usually developers will already have abstractions that allow them to do efficient data retrieval without thinking about shards, yet forcing to think twice when they would be doing something expensive. The cost is always there, visible or invisible, and someone has to deal with it.

At least we know who is behind this. Meet Jim Starkey, database luminary behind it and few other database management systems. In MySQL world he had his fame during the rocky period of Oracle InnoDB acquisition. MySQL AB had to do something and acquired Netfrastructure – a Java application-server/database hybrid. Some of that technology was rewritten into C++ and used as a storage engine for MySQL. This whole new development had two names:

MySQL 6 and Falcon

Jim Starkey captivated crowds by dissing status quo (I saw famous people leaving the room) yet not being all correct himself. His storage engine was supposed to be architected for the web. Well, it wasn’t. At that time I was doing some work on Wikipedia and had various ideas on what works and doesn’t work in web facing databases. Jim Starkey had different ideas.

One of them was that RAM was supposed to be cheap, so Falcon was supposed to be memory hungry for various operations. Sure, RAM got cheaper but data volumes got larger. The trend I observed later was that amount of RAM per bytes stored was decreasing rather than increasing.

Another Falcon bet was that CPUs are going to be much faster, so instead of storing/reading data in any ordered fashion one was supposed to read unsorted data then sort it in memory (as RAM is cheap too). Again, major web pattern (open-ended range reads – ORDER BY … LIMIT) got missed. Lots of web-facing range queries would become very expensive, so in order to be web scale on has to rewrite their applications to fit the new model.

Random disk access was supposed to go away – and even if index looks up sparse data, Starkey believed that doing disk reads in physical order was supposed to give better performance. That was not a correct assumption at concurrent workloads and ended up missing few other important optimizations such as lack of covering index reads. We did not see too much flash around at that time, and I’m not sure how Falcon design would’ve worked on flash anyway.

I wrote some of these observations down and sent them to MySQL engineering. Jim Starkey did not reply, someone else close to him did with “let’s wait for benchmarks, then talk”. Unfortunately, decent benchmarks never arrived. I was not the only one who had questions.

There were various performance issues. For a while it was told that one should not implement low level concurrency primitives and use OS provided methods (e.g. pthreads) everywhere instead. Apparently when Falcon tried implementing internal spinlocks they did not work that well. Mark Callaghan pointed out that spinlock implementation inside Falcon was not actually spinning (compiler optimized that loop away) and was just using OS mutexes.

There were some interesting properties of the engine that could have been valuable – it had row cache at the core, kept transactional overhead in memory (you were not supposed to run long running transactions), etc.

Eventualy Falcon leadership changed and remaining team of engineers tried to salvage the product for quite a while (success of the project was measured in how many minutes it can stay up without crashing), but it all became moot once InnoDB and MySQL teams were reunited under Oracle.

And no, I will not make another “fate worse than death” joke, those are expired long ago. Though I don’t think that Falcon record expired by now, so I will take NuoDB claims with a grain of salt. I don’t know exactly what problems they are solving, so it is difficult to come up with good analysis and comparisons. I won’t and neither should they.

For past few days (days!!!) I’m not able to log into my account at Google (except GMail, thanks for that! – Voice/Blogger/Plus/Docs/… are all broken though). I’m either getting this message:

Sorry, there seems to be a problem.

Or I’m sent to infinite redirect loop. None of their self-help pages are useful or have this error mentioned.

Apparently if you get such message, you should read this message in opposite way:

“Sorry (or you should be sorry), there seems to be a problem. This service is looking for you and you seem to be temporarily unavailable. We’re not working on this at all, it may show up on some report at the end of the year though as lost revenue, sad, we see you spent few hundred dollars on our services, so you trust us, but of course we have enough money from everyone else. You may try again in few minutes, few hours or few days, it won’t help you though. Though we appreciate your patience, you should start researching and debugging this yourself instead, thank you.”

I cleared all Google cookies on my browser and it started working. Apparently this has happened to others too.

Oh well, I know I shouldn’t poke directly at people, but they deserve that sometimes (at least in my very personal opinion). Heck, I even gave 12h window for this not to be hot-headed opinion.

Those who followed MySQL at facebook development probably know how much we focus on actual performance on top of mixed-composition I/O devices (flashcache, etc) – not just retreating to comfortable zone of in-memory (or in-pure-flash) data.

I feel somewhat sad that I have to put this truism out here – disks are way more cost efficient, and if used properly can be used to facilitate way more long-term products, not just real time data. Think Wikipedia without history, think comments that disappear on old posts, together with old posts, think all 404s you hit on various articles you remember from the past and want to read.

Building the web that lasts is completely different task from what academia people imagine building the web is.

I already had this issue with other RDBMS pioneer (there’s something in common among top database luminaries) – he also suggested that disks are things of the past and now everything has to be in memory, because memory is cheap. And data can be whatever unordered clutter, because CPUs can sort it, because CPUs are cheap.

They probably missed Al Gore message. Throwing more and more hardware without fine tuning for actual operational efficiency requirements is wasteful and harms our planet. Yes, we do lots of in-memory efficiency work, so that we reduce our I/O, but at the same time we balance the workload so that I/O subsystem provides as efficient as possible delivery of the long tail.

What happens in real world if one gets 2x efficiency gain? Twice more data can be stored, twice more data intensive products can be launched.
What happens in academia of in-memory databases, if one gets 2x efficiency gain? A paper.
What happens when real world doesn’t read your papers anymore? You troll everyone via GigaOM.

Though sure, there’s some operational overhead in handling sharding and availability of MySQL deployments, at large scale it becomes somewhat constant cost, whereas operational efficiency gains are linear.

Update: Quite a few people pointed out that I was dissing a person who has done incredible amount of contributions, or that I’m anti-academia. I’m not, and I extremely value any work that people do wherever they are, albeit I do apply critical thinking to whatever they speak.

In my text above (I don’t want to edit and hide what I said) I don’t mean that “a paper” is useless. Me and my colleagues do read papers and try to understand the direction of computer science and how it applies to our work (there are indeed various problems yet to solve). I’d love to come up with something worth a paper (and quite a few of my colleagues did).

Still, if someone does not find that direction useful, there’s no way to portray them the way the original GigaOM article did.

I feel ashamed that I ever wanted you to support 4.0->5.1 replication, and apologize for that. I really understand that it was really egoistic of me even to consider you should be involved in this.

I even understand that 5.0 is running out of active support (I’m not questioning that you’ll stop supporting 4.1 entirely too), and you’ll stop doing pretty much anything to 5.0, except “critical security fixes” (w00t, I managed to get one into 4.1, 8 year old MITM flaw :).

I really understand that supporting more than one release is very very very difficult, and people should do only adjacent version upgrade.

I’m not asking you much, but, maybe you could then support 5.1 to 5.1 replication? I don’t want much, just:

Gracefully recover after slave crashes.

Don’t have single serial reading of pages for replication stream as a bottleneck – either read-ahead properly (you can do that with RBR!!!), or apply events in parallel (you can do that with RBR too!)

Allow to edit replication filters without restarting servers.

Allow to enable and disable binary logging of events received from master, as well as enabling and disabling binary logging without restarting the instance.

I hope it isn’t too much too ask! It is just supported replication between two same version instances.