10 Comments

Ubuntu (and Debian) includes an iso_3166_2.xml file as part of its iso-codes package. It appears to be manually maintained by tracking news on the ISO website, but it's probably a reasonably reliable source. Whatsmore, the package includes lots of translations.

I'd be happy to take over Locale::Country::SubCountry and port it to run on top of Ubuntu/Debian's data.

I run Debian, so I'll look into the iso-code package you mention. Thanx for the offer to take it over. I've found /usr/share/xml/iso-codes/iso_3166_2.xml on my machine, so I'll examine that. Pause... Nope - it does not use native (Chinese, Arabic, etc) scripts, and contains only slightly more info than I get from Wikipedia. Nice try, though. My code will run on any OS/Perl installation with DBD::SQLite. I see no point in using XML for this, or of limiting myself to Debian. Of course, I /could/ make a case for limiting everybody to Debian, but I digress :-)). Also, I will supply scripts to export the data as CSV and HTML. Hmmm - I might even ship those output files too.

2 of 2: Hi Mark

No, I have not heard of geonames.org. Sigh - the internet is damned huge, I can tell you that. Pause... Ack! Chrome says I bookmarked in some time in the distant past. $repeat_comment_on_size_of_internet; Yes, I remember it now. A little bit more complex than Wikipedia to scrape, but a very nice resource admittedly. Still no native scripts.

3 of 2: $many x $thanx;
Or, as Adam Kennedy thinks I should write:
$thanx x $many;
Obviously, the poor guy has gotten too much Padre on the brain!

The Unicode CLDR contains all the information you need, even in the correct script. Go to http://unicode.org/Public/cldr/latest/ and download core.zip, look in e.g. common/main/ar.xml (for generic Arabic) and you'll find information about languages, scripts, territories, date formats and more.

I have, for a long time, been planning on writing a parser for this huge data source that is the CLDR, using XML::Rabbit, but I never seem to find the tuits. We even need parts of it for $dayjob. I got kinda stuck on the API design, as it contains such a big amount of information. If someone would like to pitch in, get in touch.

This highlights a very sad fact about the state of such basic information: there is no single, complete, update-to-date, authoritative source for country names, divisions, timezones, currency, etc in a relationally consistent format, translated, easily update-able, exportable, synchronize-able, partition-able, API-able...

Each of the sources we've identified in this thread (not to mention openstreetmap etc) have some kind of issue, but more importantly they all appear to be independent: presumably every project is either listening to the news-wire for changes, or waiting till someone else makes a change and then manually integrating.

It is almost enough to make me start a Github for geodata. Imagine a tool with the following usage:

Good point. I could use or ship iso-codes/iso_3166_2.xml. Sorry for the mis-understanding.

2 of 4: Hi Ben

But for how many subcountry names is the native script version available? And how much work can be put into such a scheme, given the varying formats of the pages in question? The few I checked would be PITA.

Luckily I have plenty of time available, but also a number of projects I'd like to work on...

3 of 4: Hi Mark

A geodb, eh? Hmmm. I heading in that direction too. My module stuffs the Wikipedia data into an SQLite db, and I have scripts which export the data as HTML and CSV.

One pain is that the SQLite web site and Oracle both ship an exe called sqlite3, which are incompatible, unless - I assume - the db was created with their own tools. Perhaps there's a command line switch which deals with this issue. I didn't check.

The current cost of the 3166 db from ISO is about 200 Swiss Francs = $222 Aust dollars. I can afford it but don't feel like paying for it. And updating is an issue too.

So, yes, I have an interest in it.

As for funding, I'm living off my savings, and will be for months, while I care for my mother (who has Alzheimer's) until I have to put her in a home, so in a sense extra funding is desired but not necessary.

But, as I said in a previous reply, various projects contend for my time. This is good, since the intellectual stimulation is important, but is also a type of complexity, and complexity is always a red flag for me.

4 of 4: Hi Robin

Thanx for the URL. I was not aware of that. Of course, this whole process is a big learning curve for me, but I do realise unicode is not going away so I'm absorbing it in stages.

I may well shift my data source over to that file.

As for the API, I think it'd better be the classic one-small-step-at-a-time API.

Ideas/etc very welcome. Perhaps also a more convenient discussion forum would be an idea.

Cheers
Ron

Leave a comment

Name

Email Address

URL

Remember personal info?

Comments
(You may use HTML tags for style)

About Ron Savage

I try to write all code in Perl, but find I end up writing in bash, CSS, HTML, JS, and SQL, and doing database design, just to get anything done...