The First Development Milestone for MySQL 8.0

Two new features looked important to me: modern UCA collations, and roles. So I downloaded and tried them out.

Modern UCA collations

MySQL is going to switch to utf8mb4 for the default character set and add collations based on the the latest version of the Unicode Collation Algorithm (UCA 9.0.0). I still see messages indicating the default is still latin1, but they're incorrect, I can put 4-byte UTF-8 characters in columns that I created without explicitly saying utf8mb4.

The new collations are only for utf8mb4. That's suboptimal. People still have good reasons to use other character sets (I discussed some of them in an earlier blog post). And in any case, a collation that works for utf8mb4's repertoire will work for every character set that has a pure subset of that repertoire, which is to say, every character set.

The new collations are only for the generic "Default Unicode Collation Element Table" (DUCET), and for Latin-based alphabets. So there are no updates for "persian" or "sinhala".

For an example, the following table shows changes between the old Swedish collation (utf8mb4_swedish_ci) and the new one (utf8mb4_sv_0900_ai_ci). The "Rule" column has what Unicode says about certain Swedish primary (level-1) comparisons, the "Example" column has what an SQL comparison would look like, the "Old" column has the results I got with utf8mb4_swedish_ci, the "New" column has the results I got with utf8mb4_sv_0900_ai_ci.

Most Swedes don't know about these rules, they apply to medieval texts or foreign names. But most Swedes do know that rules should cover the edge cases, not just the Stockholm phone book. Because it follows the Unicode rules, the new collation is better.

But the new collation's name is worse, for two reasons.

(1) The "_ai" suffix, meaning "accent insensitive", is Microsoftish. There is such a thing, but the definition of "accent" varies between languages and the factors that influence collations can be other things besides accents. Clearer suffixes for extra-sensitive collation names would be "w2" or "l2" (for weight=2 or level=2), and they're for sorting rather than searching unless you're Japanese, but a default = no-suffix-for-accents would have been okay.

(2) The "_sv" suffix, meaning "Swedish", is an unnecessary change. Compatibility with the previous suffix -- "swedish" -- would not have violated UCA specifications and would have been clearer for people who have used MySQL before.

For a second example, I looked at the new "Latin" collation, utf8mb4_la_0900_ai_ci. This time I couldn't find any rules file in the Unicode standard directory. There is a UCA chart for Latin but utf8mb4_la_0900_ai_ci obviously isn't following it at all. Instead it's like MySQL's old and silly "Roman" collation, where i=j and u=v. This is not an important collation. But MySQL claims the new collations follow UCA rules, and here is one that doesn't, so I worry about the others.

This has to be taken in context -- MySQL has far better support for character sets and collations than any other open-source DBMS, except sometimes MariaDB. And now it's a weenie bit more far better. Observations about paucity of new UCA collations, bad names, or standard non-compliance won't change that fact.

Roles

The first thing I noticed is that the syntax rules for roles are, too often, the same as the syntax rules for users. This is especially obvious when I ask for things that make no sense for roles, for example:

Because of this, some non-standard limitations exist: maximum name length is 32, names are case sensitive, role names cannot be the same as user names, and there is no separate information_schema table

However, the DML statements that I tested for MariaDB do work with MySQL as well, and are often exactly the same:

(The last GRANT example surprised me. MariaDB has trouble granting multiple roles in one statement, it's Bug#5772. MySQL appears to be "solving" it by making certain role names illegal unless they're delimited; I'm not sure that's the right way to solve it.)

Circular roles (GRANT r1 TO r2; GRANT r2 TO r1;) are allowed but I expect they'll be disallowed in a later version.

To generalize: so far MySQL 8.0.0 allows creation of roles but they have to look like users. So the syntax is undesirable, but they work properly.

Again, remember the context. There's nothing wrong with a feature that's not ready, until MySQL declares that it's ready.

Typos

MySQL's announcement, buried in a section about minor fixes, says "Foreign key names as stored in the foreign_keys and foreign_key_column_usage tables are a maximum of 64 characters, per the SQL standard". Er, up to a point. The SQL standard says "In a regular identifier, the number of identifier parts shall be less than 128."

Us Too

We have a new-version announcement too. Version 1.0.3 of the Ocelot Graphical User Interface (ocelotgui) for MySQL and MariaDB came out on Tuesday September 27 2016. Some new items are ...

As well as getting result sets in the result-set widget, one can get them added to the history widget, with the same format as what the mysql client outputs.

As well as predicting what the next word should be, Ocelot's syntax recognizer makes it possible to show hints if the user hovers over a word.

Finally, there is a rudimentary formatter. Clicking the edit menu item Edit|Format will change indentation, make keywords upper case, etc. I say "rudimentary" because, without a standard to follow, one must depend on taste, and nobody shares the taste that's on display here.

2 Comments

"But most Swedes do know that rules should cover the edge cases, not just the Stockholm phone book"=>
You shouldn't underestimate the influence of the Stockholm phone book. As a country boy I've surely experienced what access to big city directories can do.

"Circular roles (GRANT r1 TO r2; GRANT r2 TO r1;) are allowed but I expect they'll be disallowed in a later version." =>

Role hierarchies are resolved so that they form a tree using a breadth first strategy (see class Get_access_maps).
Since an authorization ID can be either a USER or a ROLE or /both/, there can be cases where these seemingly circular references make sense. It depends on your business process basically.
Consider authid feature_a(gen_feature_dev(contracted_dev))

1) Sinhala was dropped from 8.0.0 and will be implemented together
with a more general and improved implementation of collations for Indic
scripts.

2) Persian was dropped from 8.0.0 and will be implemented together
with a more general and improved implementation of collations for Arabic
scripts.

4) We have chosen to use ISO-639 language codes instead of English
names on the collations because we consider that more precise.

3) utf8mb4_la_0900_ai_ci is, as you say, the "old and silly "Roman"
collation" and is kept for legacy. The classical Latin language have
no entry in CLDR. The table you point to, however,
(http://unicode.org/charts/collation/) is not for the Classical Latin
language but defines the basic order of the Latin *letters*. So all
the new collations is based on UCA/CLDR with the exception of
utf8mb4_la_0900_ai_ci. You may consider that a documentation bug.