I was always a fan of CACert but unfortunately, Gentoo recently decided to no longer trust CACert by default which caused our overlay to become unavailable for a few weeks without us noticing it (thanks for not notifying users about that unexpected move…). Gentoo thus forced us to switch to Let’s Encrypt immediately (or pay for a commercial certificate) if we wanted to keep our repository easily accessible. While we could have issued a LE certficate manually, that method is completely impractical as certificates issued by LE expire after just 90 days, while CACert’s certificates lasted 6 months or even 2 years if you passed the assurance tests.

Choosing an ACME client, aka “bot”

The implied requirement for automated renewal means you have to install a client for the so-called ACME protocol and allow it to generate a key & CSR, submit it to LE, somehow provide a domain-based verification, finally reload your server application and repeat that whole process periodically on its own. As we want to install certificates into an Apache web server but there’s no module to handle LE certificate issuance directly from within Apache yet, we have to use some tool to perform the necessary tasks for us. The first client published by LE themselves (“Certbot” or app-crypt/acme on Gentoo) had a terrible reputation as it was heavily modifying the system it was being installed on. Soon after the initial start last year, multiple alternate implementations became available, more or less system-specific and more or less disruptive in the way they hook into the system.

I wanted to avoid that. The client I would choose for our server should not manipulate any configs on its own and it shouldn’t install extra dependencies outside the regular package management provided by the system (portage on Gentoo). It should just handle certificate creation/registration and renewal, nothing else. And it should have a comprehensive documentation.

Judging from just the documentation, my personal favourites so far have been acme.sh and acmetool. While acme.sh would have had the big advantage that it requires basically no (unusual) dependencies as it’s “just a shell script”, that was also the reason why I decided against deploying it on my server. While there’s no rational reason against that script, I, as well as the friend I’m sharing the server with, had a gut instinct that didn’t allow us to trust an externally developed shell script to perform some periodic task in processing external data (i.e. interaction with LE’s ACME server). I therefore decided to try acmetool instead, which has been developed in Go.

This article thus details the installation of acmetool on Gentoo but most steps should apply to other clients as well.

Wow, it’s been a very long time since I last published any post. I prepared quite a few in the meantime but never polished them up for publication. However, since this issue appears to be reoccuring right now (March 2016) I’ve decided to finally put it online. Please do not mistake any time-related information in this post for up-to-date, as I last updated this post 2 years ago and I’m going to publish it largely unrevised now.

I originally wrote this post back in March 2014 with full names in it. I hoped that the service I encountered problems with would get fixed but it wasn’t as of May 2014. I still won’t name it directly to avoid any legal trouble but it should be enough to just check your config for these points if you encounter problems with any servers on the Internet. In the meantime, I’ve seen more than just this one service having the same issues.

There’s a web service that caught my interest lately (ehm… back in 2014 😉 ), so it came that I wanted to access it from my home computer having an IPv6 connection via SixXS/NetCologne. To my surprise, I was unable to establish a connection via IPv6, only IPv4. I didn’t find the problem at that moment so I just forced access via IPv4 to work around that issue. A few weeks later, after having deployed IPv6 at work (also SixXS via NetCologne), I noticed the same effect there as well. Strangely, a friend had no problems accessing it via either IPv6 or IPv4 so I figured that there might be a routing issue or the company (or their CDN) may be blocking certain connections that do not match geo-lookups done via DNS. Since I had the (slightly unprofessional) impression that “the Internet was broken” around early March (at least if you were using Deutsche Telekom as ISP), I put that issue aside and revisited it only later.

The service was still inaccessible via IPv6 from my networks but the friend, having native IPv6 from Deutsche Telekom, could access it without problems. We compared DNS but Telekom DNS, Google DNS and NetCologne DNS always resolved to the same addresses, so there shouldn’t be any issue with geo-lookups. Finally, I found a thread where other people experienced the same issue and suspected the MTU size and missing ICMPv6 to be a problem. Oookay…?

So apparently, the operator deployed IPv6 to their servers and missed that ICMPv6 is mandatory for IPv6 to work properly. The issue appears to not have made it to their network operations department yet, so nobody fixed it on their end so far. And indeed: Setting the MTU to 1280 locally made the service to be instantly reachable. Let’s investigate what happened here as it’s mainly (but not solely) the operators fault:

Fixing the MTU on your local network

On your local end, you are using a higher MTU than 1280 (the required minimum MTU to be routed on the Internet). That is a bit unfortunate if the first hops of your upstream provider already use lower MTUs than your local defaults (usually 1500 on Linux or 1400 on Windows). What happens at this point is Path MTU Discovery, since IPv6 routers do no longer do packet fragmentation on their own (as opposed to IPv4): If your clients are sending packets that do not fit through a router’s outbound interface for the route to be taken, the router discards your packet and replies with an ICMPv6 “Packet Too Big” message which includes the MTU for its outbound interface. Your client saves that Path MTU (“pmtu”) in its route cache (Linux: ip -6 route show table cache) and retransmits the failed packet with fragmented size to match that individual MTU. This repeats until the route is fully traversable and your packets reach their actual destination. If your upstream provider is set to use a MTU of 1280 (changeable default for SixXS tunnels) and your clients try a MTU of e.g. 1500, “Packet Too Big” is being sent by your local router or – at latest – by your provider’s gateway for almost every connection you try to establish (since 1280 bytes are easy to be exceeded). Let’s see how that discovery looks like to an unrelated website with tracepath6 after forcing MTU to 1500:

We can see that PMTU starts with 1500 (the device’s default) but my local router (IP masked with xxxx) replied with “Packet Too Big”, indicating that the MTU for that path should be 1280, hence a PMTU of 1280 is being used to continue.

As I said, that’s a bit unfortunate as it means almost every connection attempt is delayed by “Packet Too Big” and the path cache for all external connections sooner or later starts filling up with PMTUs of 1280:

Apart from manually setting the interface MTUs on all your clients this can be fixed via Router Advertisements by announcing the MTU of your Internet uplink or, if unsure, simply by announcing the minimum MTU of 1280. It’s done by AdvLinkMTU if you are using radvd or by setting the router’s local interface’s MTU to the size to be used if you are using dnsmasq for instance. Upon receiving those RAs, your clients should reconfigure to that MTU immediately. When using MTU 1280, your clients should not need to rely on path MTU discovery any more (at least unless you hit routers that violate standards even further).

Back to the broken service: Why is it a problem that they block ICMPv6 if I can fix that issue locally?

This left me puzzled for a moment until I compared three packet dumps in Wireshark (service with MTU 1500 and broken PMTU discovery, service with MTU 1280, youtube.com with MTU 1500 and working PMTU discovery). You can see the packet dump of a stalled connection attempt with broken PMTU discovery below:

The initial TCP SYN packet the local client sends to create a new connection contains the Maximum Segment Size (MSS) which is equal to the MTU of the outbound interface to be used minus some bytes for packet headers. This declares the MTU to be used by the other end initially. If the client uses a small MTU such as 1280, MSS will be set accordingly which means the other end will fragment packets appropriately right from the beginning, no PMTU discovery required.

At some point, a packet sent by either side may be too big and a router replies to the sender with ICMPv6 “Packet Too Big” and the MTU to be used instead. The end that receives that message adjusts its Path MTU and retransmits the packet fragmented to match the new PMTU. Apparently, the server tries to send more than 1280 bytes in reply to my SSL “Client Hello”. That’s actually too big for my gateway or some other router in between, so the server is being sent a “Packet Too Big” message which is ignored and thus the connection stalls on both ends as the server can’t get past the router with lower MTU.

This leads to two issues:

If the other end is blocking ICMPv6 (for related connections) it cannot adjust its Path MTU although routers reply to it with “Packet Too Big” messages. If thought further, this may lead to an accumulation of dead connections on server-side which is most likely nothing you would like to have resource-wise. There should be two easy server-side fixes: Either allow ICMPv6 for related connections, so PMTU Discovery can work as it should, or always transmit with the minimum MTU of 1280 bytes instead of a higher local MTU and regardless of the TCP MSS. If connections are common to fail with any MTU higher than 1280 bytes it may be good practice for heavy-load servers to use a fixed MTU of 1280 anyway. (Please note that this assumption may not be correct and was one of the reasons I did not publish this post back in 2014 – I just didn’t find time to verify my claims…)

Unfortunately, it appears that the other end isn’t notified about the change in PMTU on one side, so unless it hits a limitation itself, it does not adjust its PMTU as well (which makes some sense since routing is not uncommon to be asymmetric, so there may be different PMTUs for each direction). In theory, if one end would be able to notify the other about a lowered MTU after TCP SYN, this would still require one end to discover the correct PMTU before the other. In this case it would not have helped as the client may request the web page with a smaller packet than 1280 bytes (in my case the largest packet sent prior to connection stall had 516 bytes). One workaround that could be implemented on clients is that the connection should be retried with a MTU of 1280 instead if connection stalls. Note that this may not work for all application protocols in all cases; in particular no remote action must have been triggered before a connection retry (which would have worked in this case as the SSL handshake for HTTPS failed, so no action should have been taken by the server yet).

I have yet to figure out why, but today one of my hard drives hit multiple CRC errors and went offline. It is part of a software (mdraid) RAID 1 and I had seen such errors before, so I did the usual procedure: Shutdown and power off for a few minutes, check that all drives come up on boot, boot to a rescue system, stop RAIDs, run a long self-test on the lost drive and then resync the RAID by running mdadm --add using the drive that remained online as source. Sounds okay? Too bad I had errors on the source drive…

When the first RAID partition’s resync neared completion around 95%, it suddenly stopped and marked the target drive as spare again. I wondered what happened and looked at the kernel log which told me that there have been many failed retries to read a particular sector which made the resync impossible to complete. Plus, S.M.A.R.T. now listed one “Current Pending Sector”.

What could I have done to avoid this?

First of all, I should have run check/repair more regularly. If a check is being run, uncorrectable read errors can be noticed and the failed sectors can be re-written from a good disk. However, check/repair can be a rather lengthy task which means it is not very suited to desktop computers that are not running 24/7.

Another option could have been to use --re-add instead of --add, which might have synced only recently modified sectors, thus skipping the bad one I hit on the full resync caused by --add. However, since I had my system in use about one hour before I noticed the emails indicating RAID failure, I doubt this could have helped much. Plus, it would likely have been too late to run that after a resync has already tried and failed as the data on the lost disk was already partially overwritten.

What I did to work around the issue

WARNING!

The following steps can cause irreparable damage to your data. Only continue if you fully understand what you are doing and you either have a working backup or can avoid loosing the data. This post is omitting information you should know if you are going to follow these steps. Please make your own mind about them before running any commands.

THE AUTHOR WILL NOT BE HELD LIABLE FOR ANY DAMAGE CAUSED BY FOLLOWING THE BELOW STEPS. FOLLOWING THIS BLOG POST IS ON YOUR OWN RISK.

NOTE: The failed sector will be called 12345 from now on. The broken sector resides on /dev/sdb and the drive that went offline and has a partial resync but good sector is /dev/sdc.

A quick search turned up some helpful sites ([1], [2]). First, I verified that I had the correct sector address by running hdparm --read-sector 12345 /dev/sdb, which returned an I/O error just as expected. I then checked the sectors immediately before and after the failed one. I was lucky to find a strangely uniform pattern that simply counted up – I’m not sure if that is some feature of either ext4 or mdraid or just random luck. I tried to ask debugfs what’s stored there (could have been free space, as in [2]) but I wasn’t sure if I had the correct ext4 block number, so I didn’t give anything on that information.

Since this is a RAID 1, I thought, maybe I could just selectively copy the sector over from /dev/sdc. What I needed to do was to get the correct sector address for sdc and then run some dd command. Since the partition layouts differ on sdb and sdc, the sector numbers don’t match 1:1 and have to be calculated. I ran parted and set unit s to get sector addresses, then ran print to get the partition tables of both disks. All that has to be done is subtracting the start offset from the failed sector address, then add the other partition’s offset again. Let’s say the address was 23456.

Since I knew what the sector should look like, I could verify it directly with hdparm. Additionally, I checked a few sectors below and above that address and the data matched perfectly.

Next, I had to assemble a dd command to display and then copy the sector. Using dd if=/dev/sdc bs=512 count=1 skip=23456 | hexdump (bs should match the logical sector size) and comparing it to the hdparm output, I could verify that I read the correct sector. I also tried a few sectors above/below again. When I was ready, I finally copied the sector: dd if=/dev/sdGOOD of=/dev/sdBAD bs=512 count=1 skip=23456 seek=12345 oflag=direct (replace sdGOOD and sdBAD by the actual drives – just making this post copy&paste-proof 🙂 ) oflag=direct is required or you will likely get an I/O error.

To be sure that everything went fine, I checked the result with hdparm again. After restarting the RAID, the resync ran fine this time.

I’m currently migrating large parts of a Windows NTFS partition to Ext4 on my desktop system and accidentally messed the modification times of all directories I moved with Dolphin (KDE) by aborting the operation and restarting it over the already created directories from the first run. I decided to hack a small script to reconstruct the modification times (getting close to the original ones). It works by recursively finding the latest, deepest nested file modification time for a directory (as those have been copied correctly) and using touch to set the same timestamp on the target directory. I would advise to call it with find on the directories it should operate on, for example if it has been saved as ~/fix-copied-dirtimes.sh:

It is by far not optimized (being called this way will perform highly redundant queries on the file system) but it works very fast nevertheless and you usually don’t run it frequently, so that’s okay…

A little word of warning: Verify the correct behaviour on your own! That means: Please manually confirm that all calls performed by this script and method turn up with the correct result. This comes without any warranty and although touch shouldn’t do any more than changing timestamps you never know… 😉 So… do your checks please and try it on some unimportant test directories first.

In the past months I experienced random irresponsiveness of several websites such as Facebook, Netvibes, Microsoft, wetteronline.de (a German weather site), just to name a few. All those sites have one thing in common: they use (complete or partly) Akamai as a CDN. Traceroutes show that I end up in Amsterdam, being blocked behind xe-2-0-0.ams21.ip4.tinet.net or xe-3-0-0.ams21.ip4.tinet.net. Letting my router “redial” to get another IP address does not work, after a few days the problem usually disappears again. In the meantime, I helped by using Opera Turbo as a proxy to access those sites.

Yesterday, the problems started again. Having done some traceroutes and confirmed that there were again unreachable Akamai hosts in Amsterdam, I decided to open a trouble ticket at my ISP. Just before I was about to send it, I tried from another computer on our home network, just to be sure. Strangely, all websites worked well. A traceroute ended in what seems to be a colocation of Akamai at our ISP’s data center in Hamburg. Then I remembered that I had my desktop set up to use OpenDNS. I ran dig +trace on the domains and confirmed that if Akamai is queried directly, it points me to Hamburg, not Amsterdam. After changing the nameservers back to my ISP’s, I could access all sites again.

The basic problem with public DNS services such as OpenDNS or Google (the famous, easy to remember, 8.8.8.8 and 8.8.4.4) not being able to provide the geo-location-based DNS resolution in the way CDNs require it for their load balancing and low latency is nothing new. If you search for opendns akamai you will find a lot of forum posts, blogs and articles about it. Under the title “In a CDN’d world, OpenDNS is the enemy!” Sajal Kayan made a nice comparison matrix of how latency is affected by using OpenDNS and Google to resolve the Akamai CDN. What was new to me was that in my case Akamai or some hop close to it seems to take active counter-measures to avoid the (really so large?) extra traffic that should be directed elsewhere if DNS resolution works as intended and goes even as far as to block one of Germany’s largest ISPs whose customers I would not suspect to use external DNS resolvers in so large numbers that it seriously impacts CDNs. I don’t intend to point fingers to either CDNs or public DNS resolvers on this issue since both sides have their points for working the way they do and there won’t be any practical solution to this situation other than to avoid the problem from a user perspective by using local resolvers.

So, bottom line: If you have trouble reaching popular websites and use OpenDNS or Google DNS, try again with local nameservers and if the public DNS resolver was causing the problem, resort to a local resolver instead.

Being new to developing Java web applications using a lot of dependencies, I encountered a few issues when deploying an update onto our Geronimo app server. Since some were hard to find solutions for, I decided to write them down in this blog post. I didn’t have any of these issues while running the application for development by mvn jetty:run (even on the same machine as Geronimo).

Doubled dependencies
We use Quartz Scheduler in our application while Geronimo itself also uses Quartz (however, in an older version). That resulted in a module conflict indicated by an IncompatibleClassChangeError since both libraries can not be loaded at the same time within the same classloader. The solution to this was rather simple: all that was necessary is adding to the deployment plan (which may better be placed in a separate directory so you can easily switch between multiple deployment targets). If our application was using Quartz 1.6 (the version Geronimo is using) I might have simply added a dependency to my deployment plan and could have shared the classes.

Missing XML implementation
Our application also writes PDFs using the Apache XSL-FO processor. For a reason I still don’t understand, there appeared to be only stubs available but no implementations although it worked happily with apparently the same configuration on the same machine when run from mvn jetty:run instead of Geronimo. The message I got was something like “org.apache.xerces.jaxp.SAXParserFactoryImpl not found”. After a lot of search and failed tries I figured out that I simply had to add xerces/xercesImpl as a dependency. Another option may have been to dig deeper into property handling and figure out how to properly solve the problem by specifying an existing implementation as suggested on an older StackOverflow question. (however I’m unsure if that really was the problem as it appeared that the classes were missing but the class name was correct and it worked fine from command line so the classes – to my understanding – should have been available through the default class loader)

LinkageError
The last problem I had to deal with took me much longer to figure out. Take a look at this part of a stack trace:

Caused by: java.lang.LinkageError: loader constraint violation: when resolving method "javax.imageio.metadata.IIOMetadata.getAsTree(Ljava/lang/String;)Lorg/w3c/dom/Node;" the class loader (instance of org/apache/geronimo/kernel/config/MultiParentClassLoader) of the current class, org/apache/xmlgraphics/image/loader/impl/imageio/ImageIOUtil, and the class loader (instance of <bootloader>) for resolved class, javax/imageio/metadata/IIOMetadata, have different Class objects for the type org/w3c/dom/Node used in the signature
at org.apache.xmlgraphics.image.loader.impl.imageio.ImageIOUtil.extractResolution(ImageIOUtil.java:54)
at org.apache.xmlgraphics.image.loader.impl.imageio.PreloaderImageIO.preloadImage(PreloaderImageIO.java:101)
at org.apache.xmlgraphics.image.loader.ImageManager.preloadImage(ImageManager.java:175)
at org.apache.xmlgraphics.image.loader.cache.ImageCache.needImageInfo(ImageCache.java:128)
at org.apache.xmlgraphics.image.loader.ImageManager.getImageInfo(ImageManager.java:122)
at org.apache.fop.fo.flow.ExternalGraphic.bind(ExternalGraphic.java:81)
at org.apache.fop.fo.FObj.processNode(FObj.java:123)
at org.apache.fop.fo.FOTreeBuilder$MainFOHandler.startElement(FOTreeBuilder.java:282)
at org.apache.fop.fo.FOTreeBuilder.startElement(FOTreeBuilder.java:171)
at org.apache.xalan.transformer.TransformerIdentityImpl.startElement(TransformerIdentityImpl.java:1020)
at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:432)

Let’s take a step back from that clutter of information, take a deep breath and examine the top-most sentence a bit closer:

we have some sort of class loader conflict

the method getAsTree in class javax.imageio.metadata.IIOMetadata refers to org.w3c.dom.Node

the class loader we are coming from is “MultiParentClassLoader”, provided by Geronimo

we come from org.apache.xmlgraphics.image.loader.impl.imageio.ImageIOUtil

the class loader being used by the method we try to call is called “bootloader”

the access to “bootloader” originates from javax.imageio.metadata.IIOMetadata

both class loaders link to incompatible and thus conflicting signatures of org.w3c.dom.Node, so we cannot continue

Let’s interpret these facts: org.apache.xmlgraphics.….ImageIOUtil calls a method from javax.imageio.metadata.IIOMetadata. We have at least two class loaders that have different understandings of what org.w3c.dom.Node should look like and both classes we access are using a different one of these signatures. Apparently the javax packages are provided by the JDK and are being used by xmlgraphics, so our dependency of xmlgraphics appears to be incompatible with the JDK we are running. We are confident that both the JDK and xmlgraphics are up-to-date, so what next?

I searched for over an hour and couldn’t find anything relevant except some voodoo stuff. At first I tried to hide classes by adding a filter in the deployment plan; then I tried to enforce one specific version by adding a direct dependency to org.w3c.dom. It was a bug report saying something like “strange, org.w3c.dom hasn’t been touched for years” plus another report saying “use xml-apis-1.3.04” that got me on the right track: Today’s Java versions seem to ship with at least some org.w3c.dom classes. However, Maven pulled xml-apis as a dependency nevertheless which includes its own versions of org.w3c packages but doesn’t seem to be a problem unless deployed to the app server. Maybe that’s a side-effect of the inverse classloading we enabled to get Quartz running. The solution was to simply exclude that dependency in the pom file. If you are using NetBeans you can simply right-click xml-apis and select “Exclude Dependency” which will automatically add

Taking a look at where my disk space went, I was quite surprised to find 11GiB in a directory at ~/.wine/drive_c/windows/profiles/username/Local Settings/Temporary Internet Files/Content.IE5/

It appears that this is an already reported issue with wine, as downloads made through calls to wininet.dll are supposed to be deleted by Windows sooner or later. As wine caches its downloads the same way but without ever removing old files, everything ever downloaded through wine’s wininet.dll currently gets cached indefinitely. To work around that problem, it’s currently necessary to clear that folder once in a while (or to automate that on reboots or similar). Simply deleting that directory should work just fine as it is supposed to be created automatically if needed.

It’s probably best to also search for other temp directories inside ~/.wine from time to time. Apart from cached downloads I could free another 6GiB of unnecessarily wasted space at the usual locations. It’s easy to forget about “C:” when running wine… 🙂

When I decided to try a custom ROM for my HTC Hero about one and a half months ago (official ROMs are still stuck at Android 1.5…) I chose to set LauncherPro as my default launcher which came preinstalled with the ROM. A few minutes ago I started to get a popup notice telling me “This version of LauncherPro has expired.” Nice to know (why does it expire anyway?!), but unfortunately it effectively locked me out from my phone: I was unable to switch back anywhere I could have done something useful (like opening a browser or the app drawer or switching back to a different launcher). To unlock the phone again, I was forced to install the recent version from their homepage. Luckily, I had USB debugging already turned on (else I would have been lucky if I could have got into the settings to activate it) and was at home where I have the Android SDK installed. After downloading the latest version, all I had to do was plugging the phone into a USB port and run: (-r is important or installation fails with “INSTALL_FAILED_ALREADY_EXISTS”)

adb install -r LauncherPro-0.7.1.0.apk

After a few seconds the install completed and after disconnecting the phone from USB I got the app chooser where I could re-select LauncherPro. A few moments later it finished reloading all widgets and the phone worked as always.

While I really like LauncherPro and would consider buying it, this incident scared me: I may have been able to get to the Market using the “back” button to download an update (had it open sometime yesterday) but imagine not being able to get there from the phone itself while being nowhere near a computer or with USB debugging turned off – I usually disable it and only had it turned on because I forgot that last week. There’s at least one guy at their forums who has USB debugging disabled. The developer apologizes in that thread but having had that unexpected sudden (remote?) deactivation, I wonder what else LauncherPro may do.

I bought a HTC Hero last year in September and was quite happy with it. HTC’s custom Sense UI looked far better than the default Android UI. It also shipped with better applications, less Google bundling and pioneered some basic multitouch support (only in the browser and the photo app) and integration with Flickr, Facebook and Twitter.

Sense’s initial release was very sluggish (see older reviews on YouTube) but has just become fixed when I ordered my phone. After that, there was one more update in November without information on what has been fixed with it. Since it required yet another wipe and was still Android 1.5 (Codename Cupcake) and everything worked fine for me so far, I decided not to install it.

It’s the end of May now, 6 months since that last update, 8 months since the UI fix and when I got my phone. Android 1.6 (Donut) was released the day after I bought my phone, providing new features such as VPN connections, text-to-speech support, a new market application and multiple screen resolutions. I would call the last feature the most important one since newer devices required it and soon applications started requiring that API, resulting in older Android versions not being able to run them (they are simply hidden from the market). In preparation of the Droid release in November Android 2.0 (Eclair) was released just one month later, along with support for newer hardware (camera flash, OpenGL 2.0 ES) it also added more Bluetooth profiles (up to that point Android phones could only interact with headsets) and a better Bluetooth API as well as Google Voice Search among others. If I remember correctly, this was mainly used by the Droid and no other phone at that time. It was Android 2.1 (also Eclair) that shipped these new features to phones from other manufacturers since January but not without adding some more features such as live wallpapers. HTC said they would skip 1.6 and go directly to 2.0/2.1 for the Hero. Android 2.1’s release date was timed with the release of Nexus One which was manufactured by HTC. Last week, Android 2.2 (Froyo = Frozen yogurt) was released, bringing (among others) WiFi and USB tethering without the need of rooting your device.

Sprint released a 2.1 update for their customized Hero last week, one week after another customized revision called “Droid Eris” got the update. Hero owners using the plain GSM phones (in contrast to CDMA by US operators) are still waiting in vain for a release. There were multiple release dates, both rumored and official, each cancelled a few days before or simply missed. According to phandroid.com HTC now has yet another release date for us: A first preliminary update should roll out someday in June and the final 2.1 update should follow “a couple of weeks later” which could easily mean July or August regarding their previously announced dates. HTC also said, that 2.2 would come to all phones released in 2010 – that excludes the Hero for now…

Actually, Google released Android 1.6, 2.0 and 2.1 much too fast for any manufacturer to keep up with in time, that’s true. Also, HTC’s custom UI called Sense has to be ported to each Android revision they release an update for. But HTC already released a phone running Sense on 1.6. So assuming, they stopped support for Sense@1.5 at latest in November, they now had 6 months to tinker at ports to newer Android versions and prepare an update for the Hero. And they did: Not only did Droid Eris and Sprint Hero get an update in the past weeks, they also released a lot of phones previously, developing even more. All have similar hardware and starting from the Hero all of their Android phones have Sense UI.

So why didn’t HTC release any 2.1 update for the Hero yet? Considering that at least the Sprint Hero could have had its update so “soon” only because the network operator Sprint may have built its own release, I’ve got the notion that HTC is building throw-away phones: If one gets outdated, just dispose it and buy a new one; maintenance will only run for a few months and then suddenly stops. This may have worked for conventional phones in the past but since smartphones run on a common operating system that is often maintained by another company, that’s the completely wrong path to keep the Android platform healthy and customers as well as developers satisfied. By healthy I mean that a device should usually support all public API levels until the hardware becomes insufficient which shouldn’t be that much of a problem with smartphones which are nothing but PDAs with a cellular radio chip. However, since Android is open source and modifications to the Linux kernel have to be published under GPL, Android phones should be open to custom upgrades and so is the HTC Hero.

Unfortunately I’m legally unable to link or name any of the custom ROMs I’ve found but if you do a standard Google search you will most likely find them quite easily. They either run the plain Android system or incorporate (pirated) copies of Sense UI from leaked or previously released images. Since I’m not the only one who is sick of HTC’s poor excuses, there seems to exist a variety of ROMs specifically targeting the Hero. Reading into what’s necessary to get an update, I’ve found out that HTC is actually making it difficult to flash an inofficial ROM onto the phone. This involves unlocking a special diagnosis mode and signed files; things I would not have expected at all from a badly maintained open source based phone! Although I didn’t try it myself, it seems like even versions containing ripped Sense binaries run surprisingly fine (with some minor bugs and inconveniences) on the Hero, so the question remains: Why does HTC delay updates if a community of a few unrelated developers is able to build almost completely working releases?

To make things worse, it seems like at least a few released 2.1 images introduced a jail lock that appears to not have been broken completely yet. So you are strongly advised to check the news on this topic before installing any official HTC updates from now on or you may not be able to ever go further than Android 2.1.

Why does this happen? What are they thinking? I start to regret having bought my Hero with the wrong expectation to have a modifiable phone that doesn’t outdate for a few years due to an evolving open source operating system. All I can do now is to warn other people from buying HTC’s phones without prior investigation of possible issues. While other Android phones may also be several months late lagging behind Google’s SDK releases, that may be excusable up to some point (especially since Google sprinted ahead with their releases since 1.6). The day the Hero finally receives its 2.1 update, other phones will already get their updates to 2.2 and I doubt HTC will continue to update the Hero further. Since a jail lock may be introduced to the normal Hero by the preliminary update in June I would strongly advise considering either getting a custom ROM instead or live with 1.5 and wait for more information (is HTC serious about 2.1 and will the Hero get 2.2?). Once you installed that update you may not be able to go further without buying a new phone although the hardware would be capable of it. All I know for sure is that I won’t buy another HTC phone in near future.

As of May 17, Android currently has an almost equally distributed version fragmentation across 1.5, 1.6 and 2.1. In the advent of 2.2 and the final updates to 2.1 I would expect that we are one or two months from finally calling 1.5 (the third API level) “legacy”. This means that the number of devices running 1.5 will drop significantly low and therefore less and less applications will run on 1.5 due to API updates more current applications may prefer to use (with Android 2.2 we have reached API level 8). The 34,1% share of Android 1.5 may in fact contain a large percentage of Heros since almost every other phone has had updates and the Hero sold pretty well. Developers have or will have to choose whether they want to maintain a legacy “Hero revision” of their software or not. Thus by holding back OS updates, HTC is not only upsetting its customers but is annoying developers as well, so its highly unlikely they will support a 10% or less share of 1.5 users. Running out on updates in less than one year isn’t what Android was meant to be nor what I would expect from a 400€ device.