BLOG LANGUAGES.

The NITLE Blog Census has a page listing the languages used (or apparently used: “our algorithm decides what language a blog is in by looking at the text content, and not at any language attributes in the markup”) in the 1,449,515 weblogs they’ve indexed. English, unsurprisingly, is far in the lead, but below that things get interesting; in particular, I find it hard to believe Russian is so far down the list. (Via the Sidelights column of Electrolite.)

Could someone please also direct me to even *one* weblog in Breton? Apparently there are 648 of them and I’m not even sure there are 648 Breton speakers with access to the internet in the first place!
(…but I’d love to know what language they’re confusing with Breton based on the statistical analysis they’re doing!)
Similarly I don’t really think that there are actually over 800 blogs in *Latin*.
Probably the “statistical techniques” they’re using have a few glitches – though it would be very interesting to learn what’s their error ratio.
(also what sort of statistical analysis could they be using to identify the languages – does anyone have any idea?)

There are several language identification systems out there on the web. The one that seems to be best known (and the one which is built in to the NITLE system) is Gertjan van Noord’s TextCat, another is SILC; there are several more. Most do something along the lines of Cavnar and Trenkle’s N-Gram-Based Text Categorization, as van Noord points out.
The thing is, TextCat has to be trained on sample text. If it hasn’t been trained on a certain language, then the chance of that language being identified is zero.
That’s probably the case here. Maciej Ceglowski is the guy who set up the system, I’ll ask him about the cases of Russian and Japanese.
Interestingly, he pointed out that the reason that Latin shows up with such a high score is not that Elvis has started blogging, but rather that there are many inactive blogs that contain nothing but lorem ipsum text.

Answers – Yes, the program is TextCat, which uses statistical analysis (basically looks at three-byte-long snippets to guess at the language and encoding). Breton, Catalan and other regional languages are usually misidentified French and Spanish blogs, although the algorithm has an unusual predilection to classify Russian as Catalan. Go figure.
I’m mystified by the absence of Japanese here, too. It’s probably because the algorithm needs a better Japanese training set. For that matter, it needs a better set for several languages. I’m planning to do that Real Soon Now ™.
For what it’s worth, the Icelandic blogs are genuine (part of a set I’ve verified by hand), and reflect a real predilection for blogging on the part of our Nordic friends. Go figure some more.
The English, French, Portuguese, Italian, and Polish ID’s are in the 99% accuracy rage for false positives, as determined by a very bored student worker doing statistical sampling. For the rest, “work remains to be done”, as they say.

Couldn’t someone Google a simple, unique Breton sentence or phrase to find Breton blogs?
Awhile back when I would occasionally go to a “blogs recently updated” site to see what was new, I did see a tremendous number first of Portuguese and the Persian sites. So many that the site became useless to me.
Speaking of Icelandic, I have an Icelandic translation of the Tao Te Ching I got off the net around here somewhere.

Maciej, you may also be interested in this algorithm presented last year by Dario Benedetto, Emanuele Caglioti, and Vittorio Loreto which used good old ZIP compression to identify languages. I can’t vouch for its effectiveness but the researchers claimed some impressive results.
The method is mind-bogglingly simple:
1. Find a number of long exemplar texts written in known languages.
2. Assume that the text we want to analyse is written in the same language as one of the exemplar texts.
3. Find the exemplar text with the smallest distance (relative entropy) to the test file
In order to define the relative entropy between two source texts A and B:
1. Extract a long sequence A from the source A and a long sequence B as well as a small sequence b from the source B.
2. Create a new sequence A + b by simply appending b after A.
3. Compress the new sequence A + b using gzip, and the measure of the length of b (which has been compressed using encoding which has been optimised for A).
DAb = LA+b – LA,
where L indicates length in bits.
4. Do the same for B and b:
DBb = LB+b – LB,
5. Estimate the relative entropy SAB between A and B by:
SAB = (DAb – DBb)/Cb
where Cb is the number of characters in the sequence b.
Here are some articles on the subject:UniSci: Zip Programs Can Identify Language Of Any DocumentNature: Algorithm makes tongue treeEconomist: The elements of stylePhysical Review Letters: Language Trees and ZippingCOMPUTING IN SCIENCE & ENGINEERING: ZIPPING OUT RELEVANT INFORMATION

Assume for the moment that their algorithm is accurate, I think it’s quite possible that the absence of expected languages from their statistics has to do with the way they find the blogs. With the exception of linguabloggers most bloggers link to sites in only one or two languages, so a crawling approach may be less likely to cross over from blogs in one language to those in another. I presume they’re using recently updated lists and sites like Daypop, but I’d still expect the prevalence of English to skew the results.
But you can help! They have a submission form that allows you to add blogs to their crawl list, so add your favorite Russian blogs, Japanese blogs, and Breton blogs, and check back later to see what they find.

-
Never mind the absence of Japanese weblogs – how plausible is the low rating for Korean weblogs?
South Koreans are the most webactive nation in the world, spending, if I remember correctly, at least one and a half times as much time online as the second most webactive. They have cheap fat pipes.
I bet there are at least thirty thousand Korean weblogs.
-

I’m less surprised by the high number of Icelandic weblogs.
Since their whole country [not just a northern portion of it as in Scandinavia or Russia or Canada] has the six-month day and the six-month night, Icelanders have long developed a rather extreme and distinctive indoor culture for the winter.
Iceland used to lead world statistics in the 70s for newspaper and book reading per head, and for other lonely hobbies like postal chess for exactly this reason.
They’re a completely natural nation for compulsive Internet use.
-

I was going to say, I bet they’re not indexing the Russian community on LiveJournal. The trouble is, friends lists aren’t stored on the same page as the journal, so I’m hard pressed to figure out a way to make the software start indexing it.

There are probably a number of national vernaculars which are un- or underrepresented in the table, just because there’s one company that hosts most of that country’s blogs. This is the case for Hebrew, for example.

Blogspot has a “most recently published” feature listing some of the blogs that have been updated in the last minute. The titles of the blogs are predominantly in English, but my impression is that Icelandic and Portugeuse vie for second place. Frequency of updating blogs on blogspot is probably not representative of netwide incidence of blogs, but my data does support NITLE’s survey.
I want to know how many Klingon language blogs there are out there. I’ve only found three. There must be more.

Thinking about this a bit more, I wonder whether teenspeek and atrocious grammar and spelling leads to misidentification of languages. Is the Breton equivalent of u r sooooo kyt!!!! pm me l8r i hv 2 tl u my news!!!! going to be identified as anything?

And you can support my book habit without even spending money on me by following my Amazon links to do your shopping (if, of course, you like shopping on Amazon); I get a small percentage of every dollar spent while someone is following my referral links, and every month I get a gift certificate that allows me to buy a few books (or, if someone has bought a big-ticket item, even more). You will not only get your purchases, you will get my blessings and a karmic boost!

Favorite rave review, by Teju Cole:
"Evidence that the internet is not as idiotic as it often looks. This site is called Language Hat and it deals with many issues of a linguistic flavor. It's a beacon of attentiveness and crisp thinking, and an excellent substitute for the daily news."

From "commonbeauty"

(Cole's blog circa 2003)

All comments are copyright their original posters. Only messages signed "languagehat" are property of and attributable to languagehat.com. All other messages and opinions expressed herein are those of the author and do not necessarily state or reflect those of languagehat.com. Languagehat.com does not endorse any potential defamatory opinions of readers, and readers should post opinions regarding third parties at their own risk. Languagehat.com reserves the right to alter or delete any questionable material posted on this site.