Large Team, Tricky Code During MySQL 3.23 and MySQL 4.0 team the small tight knit team was working on the server and Monty was personally reviewing all the code and knew all the code well to know all the side effects etc. Now MySQL Development is done by different team working on the different components, while MySQL was not designed to be very modular with clear set interfaces and no side effects on the first place to support such development concept easily.

Increasing Complexity Each new MySQL version is getting more and more complicated. MySQL architecture has to be additionally complicated to support modular storage engines with all their interactions and different specifics. Think for example how many DBMS need to do Distributed Transactions (XA) internally just to keep different components in sync. This is multiplied by the fact a lot of MySQL features were implemented in marketing rush being hacks over architecture which was not intended to support it. Remember early MySQL versions knew only about table locks and were not initially designed to support transactions. Most of other databases are designed with these concepts at their core.

Increased User Base I’m not sure if MySQL user base is growing any more but it surely has grown dramatically since MySQL 3.23 or 4.0 times. More users means they naturally find more bugs. This is actually has the good time too. Even though a lot of bugs being found and reported only small fraction of them apply to any given user. The bad thing – in many cases we do not really know which one apply to us 🙂

Low Feature Usage Density MySQL 3.23 was simple so Most of its features were used by majority of users. This was even the early Marketing strategy for MySQL. Marten used to frequently say Oracle is Overkill for most users and most users use only 10% of features or less. You know what ? MySQL is becoming overkill for increasingly large amount of users too. A lot of people in the Web world would be as happy with MySQL 4.1 if it would be actively maintained now. And hey this is exactly one if the reasons behind Drizzle project.

Lower usage density means features are less tested by community during release and they also mean they get less attention from developers, because development team (and especially its efficiency) is not growing on the same pace. This also means fewer customers use the same features, and so the revenue associated with feature is not high enough to keep all the bugs reported for this feature. So bugs sometimes stay open for years.

As an example of low usage density features I would mention MySQL 5.1 features. During my talks I often asked people how many of them use 5.1 vs row level replication or events. I think less than 5% are using both features. MySQL 4.1 was out for years and yet I would say no more than 5% of users use prepared statements.

Bug Reporting and Tracking The MySQL 3.23 received quality was at some extent because there were no bugs.mysql.com – bugs were reported to bugs@lists.mysql.com which meant both it was more complicated for users to report a bug – so fewer people did and it was easily to lose the bugs, especially boring and less critical ones. Monty did a very good job keeping track of these but still not as good as when bug tracking system was implemented and so bug would never disappear unless you fix it or call it a feature 🙂

MySQL Enterprise Madness For years MySQL QA policy heavily relied on community and I think it is still the case. This means MySQL latest code was instantly downloaded and tested by thousands of users which would find bugs quickly. With MySQL Enterprise it is customers who got the bleeding edge code and for community the binary which is over half a year old was often offered.

Enterprise customers who got the latest and the most bug free/buggy code (less old bugs more new ones) are fractions of percent of MySQL User base and they are also more conservative – few just would grab latest release as soon as it comes out. Many will only upgrade on the schedule or keep on the same release until they run into some issues. This means bugs in Enterprise version have tendency to stay for long time.

The good thing about this item is – it is easily fixable. Just get somebody in his mind to define release policy, and better have him/her understanding engineering not just marketing.

Innodb/Oracle relationship complexities. If the bug happen to affect Innodb things become even more complicated slow. Even though MySQL and Innodb is tight knit system these are two different code bases controlled by different companies. MySQL can’t just do changes to Innodb codebase and Innodb can’t do same for MySQL. Both require other side approval. This is why I guess all new Innodb Development is going with Innodb Plugin which is separate from Innodb included default MySQL. Why is not this part of stock MySQL 5.1 ? I guess because MySQL 5.1 release was closed for any significant changes for years.

Now how it all affects bugs an not features ? Well if you want to use latest Innodb you will need to use plugin which is not part of standard MySQL distribution so it, just as MySQL Enterprise, gets significantly less testing by community than standard MySQL.

Another problem for Innodb Plugin quality is lack of the full disclosure about bugs. Check this change log for example. You would see some bugs fixed have bug numbers and large number does not. This means there are some bugs which are discovered by Innodb/Oracle team but they are not disclosed via searchable public bugs database.

I should note besides all these process issues Innodb remains remarkably high quality component. The initial version of Innodb Plugin was very good in terms of quality for version having first public release as we tested it back in April.

The Good News in the End

The good news are you really do not need general “quality” or “being free from bugs” to run MySQL in production. MySQL as it is now works for a lot of applications, in particular because most of them use only small portion of MySQL features.

Related

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master’s Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

I think it should be commonly accepted that the more complex a product gets, the more buggy it will be.
I actually do not have a problem with that. Obviously, the growth in new features is much larger, and not proportional, to the number of developers, tester or otherwise MySQL employees.

I do, however, expect a product to keep the “old” or “conservative” code, as stable as before. While MySQL 5.1 offers new features, I’m not at all sure I will now start partitioning everything, or use log tables. More important to me is that everything keeps working as before, and, frankly, I also expect to see performance improvements.

Row based replication, in this respect, can be considered a “problematic” feature, as it affects so many aspects of the code, including replication, and may lead to problems in already stable systems.

I personally have seen two 5.1RC installations on production, both repeatedly crashing with nasty signal 11. In both cases, downgrading to 5.0 solved the problem.

I follow the policy mentioned above for waiting a few months before upgrading to a new major release, and this is something I consider natural for all products. It is OK to be conservative, and it is OK to let others, who can afford themselves the risk, to try out 5.1 in production. Having said that, it may also be a good idea to try out 5.1 on a testing environment, and report bugs as they are found.

Are you judging quality based on anecdotes and bug counts? Will MySQL adapt to this by being less open about bugs? Note that InnoDB doesn’t open their bug database to the community. I prefer a more personal metric for quality that might be interesting when aggregated over many customers. Does version A crash more or less than version B in production? 5.0 (with my bug fixes) is much more stable than 4.0. Of course, I have had more time to fix bugs for 5.0. We reduced the crash rate to ~0 per day over many machines and replication doesn’t cause problems like it used to.

The good old days weren’t perfect. In MySQL 4.0, a crash on the master was likely to make the binlog and InnoDB out of sync. The sync_binlog option was added in 4.1 to make this less likely, but it wasn’t until 5.0 that this was solved by the use of internal XA which unfortunately broke group commit (but I am willing to give that up for safety).

I never used 3.X in production when MyISAM was THE storage engine. I remember silly claims about how you didn’t need transactions if you had a UPS, careful programming and default column values.

Also, the golden days have set the stage for the problems of tomorrow. Too much code in the server is needlessly difficult, complex, poorly documented and not encapsulated. The code is much harder to change and support than it should be. Documentation and comments are missing for implementations and interfaces. There are few architectural comments in the code or external documents. There is too much code that does a bad job of reinventing the wheel. New programmers are guaranteed to make big mistakes. And a lot of the code that has been added in the past few years has used the same style. This is a common outcome when you have a closed development community (much more likely for proprietary products). InnoDB is a gem compared to this.

I did not mention MySQL 4.1 because it already had its issues, though feature wise it was indeed best release for Web applications. And the good development would be to have Subselects and prepared statements fixed for next release rather than adding more half baked features.

The manor features in MySQL 4.1 were Subselects, Prepared Statements and Character Sets. All they were not done by Monty and these were changes which MySQL architecture was not well designed for.

SubQueries had number of bugs in particular in non trivial cases and still have serve performance limitations as of MySQL 5.1 (so many years and releases away). Prepared statements had to be pretty much redesigned and rewritten from the way they were done originally to make it work, plus they also had their good share of bugs with repeated execution etc.

Finally character sets required changes across all parts of the server and also are generally quite tricky in the cases when you mix collations and use different character sets in the application. It took a good time before 4.1 release to make majority of functions to work good with charsets and avoid breaking apps which worked normally on 4.0 with 4.1 because of charset related glitches.

I really should clarify something. I should have spoken about “Perceived Quality” rather than just Quality. You can find a lot of bugs which were fixed in 4.0 or 3.23 well after they were released.

Really I respect MySQL a lot of having public searchable bug database with bugs found internally reported in it too. This honesty however can scare people, like oh there are hundreds of not fixed bugs in MySQL

Thanks to Peter and Mark for the compliments about the quality of InnoDB. Heikki certainly did do a remarkable job in designing and developing InnoDB, and its high quality has been preserved as we have added new developers to the team and as we have added new features to the product (as Peter notes).

In the past year, we have significantly improved our quality assurance (QA) process, resulting in more thorough and systematic testing, including specifically a focus on recovery testing. Many of the problems we have identified … and fixed … in the InnoDB Plugin probably never would have been found and reported by users. Thankfully, most of the time, users don’t run recovery operations, but when they do, they must work. Many recovery-related bugs are not easily reproducible and it is difficult to capture the information required to do so. The kind of stress tests needed to identify timing problems (like race conditions) are not typical of early-adopter workloads. The performance testing we did before the initial release of the InnoDB Plugin lead to the implementation the adaptive LRU mechanism that adjusts the behavior of the system based on whether the workload is cpu or i/o bound.

While community and user testing of released software is helpful, it is never a replacement for good design and implementation and robust testing methodology. Often, early adopter users don’t report bugs (because they may think the problem has been reported by other users or has already been fixed in the next release). Although the InnoDB Plugin was released in “early adopter” status last April, only two InnoDB Plugin bugs were reported by users in the MySQL bug database. A small number of other bugs were reported by users on the InnoDB forums. Most of the bugs were found through extensive internal testing. And, on the other hand, some of the earliest tests of the InnoDB plugin were extremely positive both in terms of performance and reliability.

Just a note on our internally-detected and fixed bugs and the idea of publishing them. Most of our internally-detected bugs involve complex testing scripts and workloads, and logs containing debugging information. This sort of information is helpful for developers, but is not meaningful to users.

We continue to work closely with the Sun/MySQL engineers to maintain the reliability of InnoDB as distributed by MySQL as well as in the InnoDB Plugin. We hope that together we can ensure that available fixes (some 22 of them at the moment) are incorporated more quickly into the standard MySQL distribution.

“Are you judging quality based on anecdotes and bug counts? Will MySQL adapt to this by being less open about bugs? Note that InnoDB doesn’t open their bug database to the community. ”

Hey Mark.

I think this does bring up a good point. I think Peter brought it up before.

MySQL really needs to be commended for their open bug database and open source repository. Even if it’s read only. I think we often forget how open they really are even if we want to see improvement.

I’d really like to see Oracle/InnoDB do the same thing. At least with MySQL 5.1 I can look at the bugs to judge the quality of the release before I upgrade.

With InnoDB 5.1 I don’t really have this luxury.

Peter.

I agree that 4.1 wasn’t perfect but for my needs it was nearly perfect. I didn’t use subqueries or charsets (we use UTF-8 encoding via the JDBC driver).

Ken,

“While community and user testing of released software is helpful, it is never a replacement for good design and implementation and robust testing methodology.”

While true, the reverse is also true.

While a robust testing methodology is helpful, it is never a replacement for an open and vibrant open source community and a distributed array of inexpensive unit testers (customers) with access to an open bug database and source repository.

I think we’re a good example. Very little of the work done by MySQL and Oracle/InnoDB over the last few years has been applicable to my company. The work from Google, Percona, and OurDelta has been a breath of fresh air.

“Although the InnoDB Plugin was released in ‘early adopter’ status last April, only two InnoDB Plugin bugs were reported by users in the MySQL bug database.”

This ends up becoming at catch-22… You have to work towards developing an open environmnent for reporting bugs and interacting with teh community.

If you don’t receive extensive bug reports at first don’t give up… The MySQL database in and of itself is good evidence that people are willing to report bugs.

“Just a note on our internally-detected and fixed bugs and the idea of publishing them. Most of our internally-detected bugs involve complex testing scripts and workloads, and logs containing debugging information. This sort of information is helpful for developers, but is not meaningful to users.”

… so then publishing this information can’t hurt anything. 🙂

We share all the internal statistics about Spinn3r with our customer base. Every statistic you can think of is exported, graphed, and recorded. Believe me it’s appreciated 🙂

Crashing is only one case of the bug. It is the most critical but there also a lot of things from wrong results bad plans and annoyances.

I also do not thing single application, or even single company is enough. If your application uses pretty static set of features sooner or later you have them worked out or worked around and version will run stable for you.

There are also some tricky bugs like race conditions or memory corruptions which can happen in very special conditions which rarely happen during operation – in this case running thousands of boxes makes a difference.

It is also a good question how many of MySQL 5.0 features do you use. I have a feeling you’re mainly using MySQL 4.0 set of features ?

The fact you do the split developers=employees and users are everybody else means you seems to be thinking about standard close source model. There is indeed fractions of percent of people who can make sense of the technical data in bugs database but with millions of users it gives thousands of people already. I know there are such people in Google, Facebook, Percona, OpenQuery.

But more than that a lot of bug actually have some meaning (or potential meaning) which is clear to the users something like “Innodb could fail to recover with certain log file sizes” – The details it only happen for prime innodb log file numbers because of unexpected hash collision may not be needed for users.

Other things – you’re speaking about various tests you’ve recently developed – I assume these are kept internal and not released as open source ?

One more note about Quality I’m speaking about here. I think the Quality of release should be mentioned by “weakest link” – if Partitions or Row Level Replication is unstable in MySQL 5.1 GA it is unstable.

At the same time this unstable release may work for your application or actually a lot of applications which simply do not touch this unstable features.

This brings us to another factor – whenever you should use given release for your application in production depends on whenever it works well or it does not. I see GA status not as seal of quality but rather as a promise there are no serious changes going to be done any more so probability of things getting broken is lower compared to early down the road.

For example my application may well work with given Alpha release but not with next one 🙂

Probably it’s a good time for a project fork. Leave only MyISAM engine, throw away every feature
that is not used, concentrate on quality, easy updates and issue tracking. It’s open source, isn’t it ?
Who wants to take the task ?