On reflection, attachment:21435.diff​ isn't exactly the right solution. An index on (comment_author,comment_author_email) would be better, but it could get pretty big, and would be slow to create on big wp_comment tables.

Maybe a 50 character prefix index would be a good trade off between size and query performance, but doesn't solve the problem of creating the index.

Seems like (comment_author_email,comment_author) would be more generally useful.

Aye, that sounds like a better idea.

In wp_allow_comment(), comment_author_email is optional, so having comment_author first would be better. On the other hand, I assume the vast majority of sites require an email address.

In check_comment_flood_db(), comment_author_email is used, but comment_author isn't. I'd expect the comment_date_gmt index to be more useful here, though.

In WP_Comment_Query::query(), comment_author_email is an allowed param, but comment_author isn't, so having comment_author_email first would probably be more useful, depending on how WP_Comment_Query::query() is being used.

That's interesting, I'm surprised it gets that high a cardinality from 2 bytes. I'm curious, is the content of comment_author on your site generally in Japanese? I'm not at all familiar with character distribution in Japanese names or writing, but assuming a random distribution, there are certainly enough characters in Japanese to provide a better cardinality than in the English alphabet.

That's interesting, I'm surprised it gets that high a cardinality from 2 bytes. I'm curious, is the content of comment_author on your site generally in Japanese? I'm not at all familiar with character distribution in Japanese names or writing, but assuming a random distribution, there are certainly enough characters in Japanese to provide a better cardinality than in the English alphabet.

Yes. Japanese.
In this case, the column is not binary so index is constructed with 2 words.
Japanese characters are almost 50,000 words so that index tree would have maximum 50,000^2 node when using depth 2 BTree index.

If other users does not have this kind of performance problem, index length should be minimum.

I'd like to fix this. However, it is too late in the 3.9 cycle for a schema change.

I'm curious about a few things. Namely:

Which index is generally going to be the best balance of speed and size, for what we do now and what we may do in the future? There are some good discussions here and possibly even a consensus, but yet a lot of data to back up a decision.

Why haven't others reported this? 600,000 records in wp_comments isn't that unusual. Surely WordPress.com would have noticed this by now? Do giant sites just actively or incidentally avoid the comment_whitelist option? ("Comment author must have a previously approved comment")