Archive for June, 2004

The thing that really strikes me most here at JavaOne is how 聰明 – Cong1 Ming2 – Clever Bright — Smart many of Sun’s Java people are. Joshua Bloch, Neal Gafter, Gilad Bracha, and several of the people who appeared at “Meet the Hotspot dev team” (I don’t have their names, unfortunately. I think one of them might have been Peter Kessler.) just seemed to be really incredibly amazingly super smart. Scary smart. Alien supercomputer mind smart. They also looked like they were really enjoying themselves presenting their work, defending their work (against criticism from both the audiences and each other!), and fielding all types of questions quickly and smoothly, as if they rehearsed the answers. They turned “stupid” questions into chances to explain things in a new way, or a simpler way, and dove instantly into unbelieveably arcane details when somebody finally asked a “good” question.

For some reason the guys who work on the core Java APIs, like Collections, were the most impressive to me. I go to work, I fire up my editor, and I load com/boring/Stupid.java, some interface that is implemented in like three places, all of which either I, or somebody who sits within three cubes of me controls, and change it. Oooooh, scary. They go to work, they load java/util/Collection.java. Every change has implications in millions? billions? well, LOTS of lines of code, most of which they have never seen, and many of which are probably using the APIs in very… let’s say “creative” ways. But yet they still manage to do things like retrofit the entire Collections API for Generics. While keeping full backwards compatibility! Really really impressive.

Too bad none of them have weblogs! I’d love to hear about this stuff all the time, rather than just once a year. Maybe they’d like a constant stream of feedback and new ideas too. Hey Tim Bray! Get on these guys! Their weblogs would probably be a lot more interesting, at least to me, than certain other new ones by Sun employees.

I’ve mentioned Hakka, one of the Chinese ethnic groups, before. Today we tried a Hakka restaurant: Mon Kiang. It was good! We had salty baked chicken, and some clay pot tofu seafood thing. Too bad we didn’t have room for more dishes, but our stomachs were jet lagged.

Well, that’s what my dictionary says, anyway. You can see this character on this listing of fish species names, so I guess it isn’t too far off. But I chose this one because of its Japanese meaning: Sushi!

And if you take the simplified version of the first character, you get an alternate (actually more common according to this survey) way to write sushi in Japanese: 寿司. As far as I can tell, the word 鮨, pronounced sushi in Japanese came first. It was then superceded (still in Japanese) by 寿司 which is pronounced the same as 鮨, and each character has some auspicious meaning or something. This was then, I’m guessing, backported to Chinese as 壽司.

She forgot why, but for some reason Judy looked up the expression 斤斤計較 – Jin1 Jin1 Ji4 Jiao4 – Pound Pound Haggle on her favorite new toy, an electronic pocket Chinese/English dictionary. It translated it to English as “Calculative”. Asking her and Jenny about it, they said it means “somebody who haggles over every pound”, or somebody who is very picky and makes sure they always get the good end of the deal. Searching for this expression led me to this site, which has a huge long awesome list of all these Chinese four word expressions. It translates that expresssion as “be particular about every point”.

Today you are supposed to eat 粽子 – Zong4 Zi5 – Zongzi because it’s 端午節 – Duan1 Wu3 Jie2 – Dragon Boat Festival. There’s a bunch of different stories of why this is the custom. They mostly revolve around this poet 屈原 – Qu1 Yuan2, who killed himself by drowning in a river on the fifth day of the fifth month (by the Chinese calendar) in 227 BC. One story says that people threw Zongzi into the water so the fish would eat them instead of Qu Yuan. A variation on that says that after death Qu Yuan appeared in people’s dreams and said he was hungry, and people threw rice in, wrapped in leaves, so that the fish couldn’t eat it, but Qu Yuan could! This version goes on to say that Qu Yuan later told them to throw them into the water from dragon shaped boats, so the fish will think that the zongzi are meant for the Dragon King.

This page has even more on Dragon Boat Festival, with more stories of how it originated, and two other interesting side points: the fifth month on the Chinese calendar is also known as “POISON MONTH” (btw: the seventh is “GHOST MONTH”, also cool). And, they used to celebrate Dragon Boat Festival in Taiwan by having massive rock throwing fights.

This version fixes that stupid mistake with htmlspecialchars in the last one that broke the forward and back navigation links on view.php. And! It includes a experimental pre-alpha version of Magpie that really tries its best to get the whole character encoding thing right. It uses this method, except it tries iconv (thanks Phil!) before trying mb_convert_encoding. Anecdotal evidence has shown that if iconv is installed, it may be more reliable than mbstring. It’s still not clear just what percentage of servers out there is going to have either one of them installed. Future versions of FoF will analyze your system during install.php and let you know what you’ve got and what encodings are going to work. For now, if you find feeds that still are getting corrupted, send me the links!

It depicts a world where all pregnant women are followed around by ghosts, because the ghosts want to get the chance to be reincarnated during childbirth. The ghosts are kind of spooky looking, sure, but there’s a benefit: they make sure the women don’t get hurt, or else they won’t be reincarnated. And this is a horror movie? Shu Qi sure doesn’t like the idea though, even after it is calmly explained to her by Philip Kwok as a natural part of the Buddhist cycle of life. She keeps trying to kill herself, so the ghost won’t “get” her baby, but instead the ghost keeps saving her. It actually gets sort of funny towards the end, as she jumps off higher and higher floors of the hospital she’s in but just… won’t… die!

I’m going to wait another day, at least, to release FoF 0.1.7. I’m still working with Kellan to figure out the approach to translating those charsets that is likely to work on the largest number of hosts. To iconv or to mbstring? That is the question!

Please note that since posting the original article, I’ve still found few more bugs and enhancements in the charset handling code. FoF 0.1.7 will incorporate them all, and I’ll go back and update the article with the final code.

Even more incredible: within 20 or 30 years your common desktop PC will come with a hard disk that size… and big archives like these will have smashed through the exabyte range and be up in the yottabytes. One yottabyte is 9,671,406,556,917,033,397,649,408 bits.

This prediction is not insane: my first computer, about 24 years ago, had 1 Kb, about 8,000 bits, of RAM. A new computer today has 1 Gb — a million fold increase. You get about a thousand fold increase in RAM size, hard disk size, and CPU speed almost every 10 years.

Update: This code has been finalized and debugged, and is now shipped as part of MagpieRSS 0.7! Sadness and rage no more!

So I have this little program, called Feed on Feeds. It’s an RSS and Atom aggregator. For a long time I’ve known that it doesn’t quite handle international characters that well, so I set out to fix it. I knew that somewhere between input feed and output HTML page, characters were getting messed up. I adopted a policy of “UTF-8 Everywhere”: since FoF has to deal with feeds in lots of different charsets, but display them all on one page, I’d translate everything into UTF-8. I UTF-ized everything in the display code, and made sure that the DB wasn’t mucking with the characters, finally closing in on the place where it seemed characters were being munged: the XML parser itself, called by MagpieRSS, the RSS and Atom parser used by FoF.

Here’s how Magpie was creating the XML parser:

$parser = xml_parser_create();

Nice! Simple! But it munges characters, especially numeric entities. After reading some PHP docs, I found that there are two things you can set in PHP’s XML parser: the source encoding, and the target encoding. You can set the target encoding this way:

This means: “Whatever charset the XML is in, I want you to translate it into UTF-8. And if you happen to find any numeric entities in there, resolve them into UTF-8 characters, too.”

So I tried that. But it still wasn’t working. Some feeds were translated into UTF-8 properly, but others weren’t. Feeds already in UTF-8 were re-encoded, resulting in gibberish. Reading some more documentation and bug reports, I found that if you don’t set the source encoding, PHP assumes your XML is in ISO-8859-1! I was amazed that PHP’s XML parser didn’t examine the XML prolog to determine the encoding, and further shocked that they chose such an insane default. But anyway. You can set both source and target encodings this way:

This means, “I’m about to give you some XML, in EBCIDIC. I want you to translate all those characters into UTF-8 while you’re parsing it. Don’t forget to turn any numeric entities you find into UTF-8, too.”

That works… but presents a problem. How do you know the charset the XML is in? The only answer I could come up with: scan the XML myself, and find the encoding!

That regex finds the charset declaration in the XML prolog itself, and if found, saves it in the variable $encoding. If it wasn’t found, it assumes the XML is in UTF-8 already, which is the default for XML.

That, finally, worked. All my feeds were reliably translated into UTF-8. But that was just by coincidence. All the feeds I subscribe to were already in UTF-8 or ISO-8859-1. After making this release, people complained that feeds in ISO-8859-15 and BIG-5 weren’t working. Consulting the PHP docs again, and double checking in the source code because it was just so surprising I found that PHP 4.x only supports UTF-8, ISO-8859-1, and US-ASCII. So anybody out there who wants to subscribe to a feed in ISO-8859-anything-but-1 or BIG5 of SHIFT-JIS is still screwed.

Even PHP 5 won’t help here, when it is released: It sort-of supports a longer list of encodings, but not BIG5 or GB2312, the two main Chinese encodings.

So I searched the PHP docs some more, and came up with a potential solution: mbstring! The mbstring family of functions supports a huge long list of encodings, and can translate between them. So here’s the final solution: use a regex to find the source encoding. If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding="utf-8" and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it. If mb_convert_encoding blows up (which it will if the source encoding is not recognized, or if the function completely doesn’t exist, which I’m told is highly probable, since it is an optional extension) just give up and pass the XML straight to the parser and avert your eyes as it makes mincemeat of the characters. At least I tried.

Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked! It was able to parse ISO-8859-15, BIG-5, even GB2312 feeds just fine, and translate them all into UTF-8 for display on a single page. I have these changes in my local copy of FoF now, and I’m going to let them burn in for a few days before I release them to the wider world, who will probably point out, within minutes, the multiple and tragic ways that even this solution fails. But until then, I proclaim that this is the state of the art in PHP XML charset-aware parsing. I think this is as good as it gets in PHP 4.x.

Footnote: when I say PHP5 sort-of supports more encodings, this is what I mean: PHP5 (I looked at RC3, maybe these bugs will be fixed by the final release) is completely nuts. The XML parser supports a bunch more encodings, but they are really hard to get to. If you try to explicitly set the input encoding, the PHP code limits you to UTF-8, ISO-8859-1, or US-ASCII, even though libxml2, the underlying parser, supports many more. But, if you know the super secret codes, you can construct the parser this way:

$parser = xml_parser_create("");

Notice the difference? In PHP5 bizarro world, passing in an empty string means “do what you should have done all along, auto detect the stupid encoding!” But, there’s another problem: if you auto-detect the stupid encoding this way, the stupid target encoding is stupidly set to ISO-8859-1. I don’t know who would want that. And it goes against the documentation, which says by default the target encoding is set to the source encoding. And again, you are restricted artifically from setting the output encoding to anything other than UTF-8, ISO-8859-1, or US-ASCII. So you could, if you want, use a regex (yuck!) to find the source encoding, but you wouldn’t be allowed to set the target encoding to match. But, at least, you can do this:

Meaning, “Auto detect the source encoding, and then translate everything, including numeric entities, into UTF-8.”

At least you will be able to… when PHP5 comes out, and is installed on the server where your application needs to run, which for me (I get complaints that FoF won’t work on PHP 3) probably won’t be for several years.

0.1.5 still had some pretty major bugs. I’ve fixed those, and tested this relase pretty thoroughly. Hopefully this one has no “SEVERITY ONE” issues.

By the way, I’m very very impressed with the “many eyes” effect on these last few releases. The community of FoF users found all the bugs almost immediately… within hours! And in the middle of the night! (at least, it was night here). And, had not only had they posted good bug reports to SF, but they even found the root causes, and produced patches! I wish the QA group at work was one tenth this good.

“Dear Steve, noticed you just checked in Boring.java. You’re using an unsynchronized Hashmap at line eight billion and seven. Funny thing, that will work most of the time, but it turns out that in cluster mode it will cause silent data loss. Checked in the fix for you, k thx bye.”

So thanks, everyone, for bearing with us during the technical difficulties, and helping out to solve them!

A tractor-trailer overturned on a curve on a highway, spilling its load of hundreds of bee hives and unleashing some nine million angry honey bees.

The bees buzzed furiously as driver Lane Miller, his arm scraped to the bone, struggled to flee his rig after it overturned Monday in Bear Trap Canyon west of Bozeman. The truck slid across the highway before coming to a stop between guardrails.

“I had to kick the windshield out of the front of the cab and the bees were on me from that moment,” said Miller, 41. “I’ve never felt so much fear in my life.”

O woe is us! After all these months our dishwasher finally came, but the guy couldn’t install it! The counter top has a lip that extends down just a tiny bit too far, making it impossible to remove the old one, or put the new machine in. It’s like the kitchen was built around the dishwasher. We’ll have to A) carefully unfasten and lift up the entire old counter top, or 2) just cut it to bits. I still don’t know which one we’re going to do, or if there’s any hope we can do it ourselves.

CSS / XHTML. The only remaining table is the feed list. As I said before, don’t worry, the much praised FoF look and feel has been scrupulously replicated in CSS. Or, you could use…

… The new frames-based, one page viewer.

Better charset handling – I’ve hacked MagpieRSS so that it does its very best to always return UTF-8, and that is used as the internal format, as well as the charset for all output pages. Kellan is looking at these changes now to see if he wants to add them, or something like them, to the mainline Magpie.

By default, read items will be purged completely (during the update) from the DB after 30 days. You can adjust this timeout, or shut this off entirely to go back to the old “never delete anything” behavior. Warning: If you’ve been using FoF to build up some kind of huge awesome database of feed items that you are very proud of, be very careful with this new version! You could very easily find them all deleted.

The cache directory is located in a smarter way. People trying to include FoF or call it from the command line should have better luck

Continuing code cleanups: more and more “logic” pulled into init.php, leaving the viewer pages purely “presentation”. More careful about namespace pollution, everything is prefixed with fof_ or FOF_. FoF should be much more “includable” now.

The meagre beginnings of a PHP API: look at fof_get_feeds and fof_get_items.

I think this time I didn’t completely screw up checking my changes into CVS. I did last time, nobody noticed.

To upgrade: install to a new, clean directory, and copy your settings over from your old config.php. There’s a lot more stuff in there now, so don’t just copy the whole file. And again, if you don’t want all your old, read items to be deleted, be very careful. Make a DB backup or something.

Toto and Effie leave soon for theirs. It’s not yet clear if they will be really cool and post up to the minute reports and pictures from Las Vegas and San Francisco at PhoFeta, or if they’ll be totally lame and leave all the computer stuff at home and just enjoy themselves.

We saw a very cute skunk family trying to cross a road (Route 66 in Portland) today. They had made it across two lanes of a four lane road when we got there: one big one leading the way, and four tiny ones trying to keep up. As I came up on them, the last one got scared by my headlights and started to creep backwards! No! Go the other way! I went around them slowly. I wonder if they made it all the way to the other side.

Sorry to keep you all waiting, but 0.1.4 is almost done. It’s taking a bit longer because for some reason it has twice as many features as I expected. They just keep creeping in there somehow: CSS, XHTML, much better charset handing, refactoring, includability, easier to use PHP API, new frames based display (shown above, but don’t worry: the award winning original FoF look and feel has been painstakingly replicated in pure CSS and is still included), database size management, …

(by the way, our stupid American system has some good points: a foot (12 inches) can be divided into two, three, four, and six equal parts. try that with your so-called “meters”. (sure, you can do it, if you like repeating decimals.) an even cooler one (even more off the topic now) is degrees. 360 isn’t arbitrary: you can divide it evenly by 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 18, 20… the list goes on. pretty good for day to day usage. i wonder who thought of that? i can’t find a satisfyinganswer. )