UTF8 data stored from a Latin1 Connection, how to check if my data is OK?

Latin1 Connection, instead of UTF8

I've had reports recently from those using my website, that they are unable to create content in Chinese or Arabic. This led me to try creating content on my site using Chinese characters and I have observed that the data being stored for these characters is a question mark

?

.

I realised from other questions and articles that I've read that I'm probably in 'character set hell'.

It seems that I've been connecting to the database using Entity Framework using a

Latin1

connection, as this is the default for MySQL, but the columns in my database are

UTF8

.

How is my data encoded?

From what I have read of the process, I think that the data in my database is

UTF8

data from the browser, encoded as

Latin1

during transit to the database which is then encoded again as

UTF8

just before it is stored in the database.

I have been using a PHP script to determine whether the data comes out correctly when I set my connection to use

UTF8

- and when I select data from my database, it doesn't look like there is any difference.

connection to existing data, using this PHP script, it displays on the screen just as it did over the

Latin1

connection when using Entity Framework, I can't tell the difference. There doesn't appear to be any issue with the data.

If I request a string of Chinese characters that appear to have been entered into the database as question marks, it displays as question marks when retrieved over a

UTF8

connection.

I expected that when I connected to the database using a

UTF8

connection, that the data would be displayed as garbage, as I was previously using a

Latin1

connection - but it isn't.

I used Entity Framework to interrogate the MySql variables that were being used for the connection before and after adding

CharSet=utf8;

to my connection string. Hopefully can give you an idea about how the connection was being established before and how it is now:

Connection before:

Connection with connection string charset updated:

How can I determine if the data in the database is encoded incorrectly, whether it is Latin1 data encoded as UTF8 so that I can decide if I can just change my connection string to use UTF8 and everything will work OK?

When trying to use utf8/utf8mb4, if you see Question Marks (regular ones, not black diamonds) (? is hex 3F),

The bytes to be stored are not encoded as utf8. Fix this.

The column in the database is CHARACTER SET utf8mb4. Fix this.

Also, check that the connection during reading is utf8mb4.

æ–°æµªæ–°é—» is Mojibake for 新浪新闻

When trying to use utf8/utf8mb4, if you see Mojibake, check the following.
This discussion also applies to Double Encoding, which is not necessarily visible.

The bytes to be stored need to be utf8-encoded.

The connection when INSERTing and SELECTing text needs to specify utf8mb4. (set_charset)

The column needs to be declared CHARACTER SET utf8mb4. (Check with SHOW CREATE TABLE.)

HTML should start with <meta charset=UTF-8>. (You did this.)

To verify, do SELECT col, HEX(col) FROM .... If the hex output for 新 is E696B0, then it is correctly encoded with utf8/utf8mb4. If you get C3A6E28093C2B0, it is "double-encoded". Usually if the hex begins with E or F, it is probably correctly encoded. Also, the hex for a single Chinese character will be 6 or 8 long in all cases. Reference.

Email codedump link for UTF8 data stored from a Latin1 Connection, how to check if my data is OK?