I just noted that a user changed some normal tracklists and album titles back to mojibakes! I cannot believe that such a veteran user (actually a trusted user) could do this thing. These tracklists have already been modified to correct letters by users who know the languages and it seems that this user simply copied deformed info from v1 MA to here. Apparently this user does not know the specific language, or even the alphabet, so he changed some correct polish tracklists to deformed one, where a lot of non-Polish letters appear. He also changed a Chinese album to complete shit. (I changed it back already.)

I have sent a warning to this user. But I do not know how wide this phenomenon could have gone. If one veteran user could do this, then there is no guarantee that others won't.

=============================================================EDIT: After a close examine of that user's modification history, I think it's not the user's mistake that the mojibakes came into being. It seems that the cause is some encoding incompatibility issue. This is really weird, since this site already uses UTF-8 encoding as default. It would be great if a computer expert could look into this question.=============================================================

So I propose the following policy: Never copy info from v1 without consideration or distinction. If you do not have decoding experience, do not rely on v1 for non-English info. For veteran users who can modify the tracklist/lyrics/album title, make sure you are familiar to the language. --You don't need to know how to write/speak that language, but you need to know at least how it looks like . For languages that user letters, you need to know the alphabets. It's very easy since we have wikipedia and google. For Far Eastern languages (Chinese, Japanese, Korean) that use logograph, or languages that you have completely no idea about, you should be extremely careful. Better use "report" and let a knowledgeable user do this than doing it yourself.

I propose that past ignorance shall be forgiven. But in the future, such stupid actions should not be tolerated.

I don't know what happened with this user exactly, but it seems that the data got garbled when he was entering song lengths. I don't think he intentionally modified the album/song titles at all. The site's encoding is indeed UTF-8 by default, but if the user manually changes his browser's encoding, it will garble everything. Other than manually changing his browser settings (and that would be -after- the page was loaded, too), I don't know how else it could have happened, especially considering one of the albums he edited has no v1 version (added in late 2011).

tl;dr, I fully endorse the above post in every way.

_________________

Von Cichlid wrote:

I work with plenty of Oriental and Indian persons and we get along pretty good, and some females as well.

Markeri, in 2013 wrote:

a fairly agreed upon date [of the beginning of metal] is 1969. Metal is almost 25 years old

Since we're on this topic, can someone fix this report? A trusted user (most likely the same one) changed the mojibake'd title to a transliteration instead of the native album name, while leaving the title tracks broken.

Maybe he is using some Soviet era web browser that is set to override encoding, possibly only for submitted form data. I'm not a "computer expert" either but this is just an idea I found in my butt. You should ask him about his browser.

Not stritly about mojibakes, but more about html entities. Sorry if I derail this thread a litle.

I started some months ago to correct html entities in song titles, using the following search: *#*;*I don't know what happened, but it's not working anymore: search never stops and never gives a result. Too many items found, maybe ? But it used to work, giving something like 20,000 results, in quite a short time (10s or so).

It's still working for band names though, but not for album or song titles.

Not stritly about mojibakes, but more about html entities. Sorry if I derail this thread a litle.

I started some months ago to correct html entities in song titles, using the following search: *#*;*I don't know what happened, but it's not working anymore: search never stops and never gives a result. Too many items found, maybe ? But it used to work, giving something like 20,000 results, in quite a short time (10s or so).

It's still working for band names though, but not for album or song titles.

May someone of higher knowledge enlighten me on this subject, please?

Well I just checked that out, and apparently, it was one specific result in the result set that bugged the datatable from displaying correctly (caused an unknown json error). You didn't see it before, probably because that result wasn't on page 1 then. I fixed the invalid entries (including one Animetal release with 42 tracks, ugh) and now the results show up correctly, but it might very well happen again.

I don't know why it causes json errors as I escape all the data before displaying it, and even one of the two json validators that I use say the json is valid and the other validator says there's an error but shows the wrong line...

Anyway, if it happens again, let me know and I'll manually fix it I guess. -_-

And thank you very much for taking care of those html entities. You've certainly got your work cut out for you. I might add links to these search results as new "todo" lists, come to think of it. You certainly could use the help.

Bands with htmlentities in their additional notes: http://dev.metal-archives.com/todo/html-entitiesThose are easy to fix, since the textarea usually renders the entities and saves the text correctly, rarely any copy/paste required, just save to overwrite.

_________________

Von Cichlid wrote:

I work with plenty of Oriental and Indian persons and we get along pretty good, and some females as well.

Markeri, in 2013 wrote:

a fairly agreed upon date [of the beginning of metal] is 1969. Metal is almost 25 years old

I've added a link to Alhadis's tool on the todo page. Keep in mind that you need the bookmark toolbar enabled for it to work (you can't drag the tool otherwise). I don't generally use the bookmark toolbar but I've enabled it just for that while I work on those lists. Neat tool, especially for cleaning song titles in one swoop. Thanks Alhadis.

_________________

Von Cichlid wrote:

I work with plenty of Oriental and Indian persons and we get along pretty good, and some females as well.

Markeri, in 2013 wrote:

a fairly agreed upon date [of the beginning of metal] is 1969. Metal is almost 25 years old

Just to remind you that html entities does not necessarily stand for correct letters. While most of them are correct, I have seen examples in which html entities actually renders mojibakes. In this case, you still have to figure out what the correct thing should be.

I compiled some basic information for the "main" languages that may be used in MA. The list should be complete enough for the purpose of removing mojibakes. There are lots of writing systems, among them Cyrillic, Latin, Greek, Hangul, abjad (Arabic, Hebrew, Thai, etc.) Hangul, Chinese characters are widely used. I did not include Armenian and Georgian scripts, which are quite different from above, because in MA there are only 20 bands from the two countries, and it seems none of them use their own script.

Abjad: I don't have any knowledge in abjad, so I can say nothing about Arabic, Hebrew, Thai, and others. I know Thai is used a lot in MA. As to the use of Arabic, Hebrew in MA, I have no idea.

Hangul (Korean), Chinese characters (Chinese, Catanese, Taiwanese, Japanese). They use characters instead of letters. In Chinese, there are 1500-2000 most commonly used characters, and one needs 3500-4000 characters to read newspapers, books without difficulty. So it's impossible to list the characters.

Latin-derived alphabets: The following only lists ''exceptional'' letters in each language. Diagraphs (combination of normal Latin letters) are not listed separately even if they are considered letters in some language in their own right, e.g. the letter lj in Croatian. Not all Latin letters are used in certain languages. For example, there is no letter Q in Icelnadic. This will not be indicated in the following.

Special notes: (i) Turkish contains two versions of letter i, one dotted, one dotless. The capital letter for dotted i is İ, eg. İstanbul. (ii) The letters ș, ț (s-comma, t-comma) in Romanian are NOT ş ţ, (s-cedilla, t-cedilla). The later letters are still widely (incorrectly) used because s-comma, t-comma were not part of early unicode, and lack an appropriate font support.(iii) Vietnam is a tonal language. There are 6 tones. The first is unmarked, and the rest are marked with different diacritics: Second tone by grave accent, e.g. Huyền, third tone by hook, e.g. Hỏi, fourth tone by tilde, e.g. Ngã, fifth tone by acute accent, e.g. Sắc, sixth tone by dot below, e.g. Nặng.(iv) The upper case of Đ, đ (d with stroke) in Bosnian, Vietnamese is identical with the upper case of Ð, ð (eth) in Icelandic, Faroese. They are not the same letter and take different positions in Unicode.(v) Serbian is the only European language with active digraphia, using both Cyrillic and Latin alphabets.

Last edited by sofeshue on Thu Feb 09, 2012 3:21 pm, edited 1 time in total.

Wow that's quite the thorough list. Good work. Out of curiosity... are you a professional linguist? You seem to really know your shit.

I'll note that in French, the characters Œ œ, Æ æ are not necessarily used per se; it's perfectly acceptable to write "oe" and "ae" instead. Luckily, the search engine understandsboth, so they can be used interchangeably.

_________________

Von Cichlid wrote:

I work with plenty of Oriental and Indian persons and we get along pretty good, and some females as well.

Markeri, in 2013 wrote:

a fairly agreed upon date [of the beginning of metal] is 1969. Metal is almost 25 years old

More html entitiesSome html entities slipped through the queries posted on the "to do" page: for example, when &eacute; is used instead of &#233;. Alhadis script is also not working on them Those can be found using *&*;* in the search field.

I correct all those in song titles, but a few are also to be found in band name (1) and album titles (21).

WAIT FOR A MOMENT. Please do not use Alhadis' tool to clean up html entities in BAND'S ALTERNATIVE NAME. This list is valuable since it tells us lots of bands whose name MIGHT be changed to original languages. Of course NOT all such names should be changed. Furthermore, from my first look, some html entities in band's alternative name are completely wrong! If we clean them up already, it would be difficult to find them, especially the wrong html entities.

Sofeshue, fixing the HTML entities isn't going to introduce any problems with fixing mojibakes. Any HTML entites that existed in band ANS fields in V1 were probably already wrongly encoded when they were submitted as HTML entities.

I know. What I mean is, since there are htmls in bands ANS field, these bands may need a close look, rather than simply converting htmls into letters. Once the list is gone, how do we find these bands easily?