Lesson Learned

As a follow-up to my last post about Mojibake character encoding corruption I want to distinguish "intermediate encoding corruption."

In a post on the JoelOnSoftware Discussion forum someone asked why about 50% of the characters in his UTF-8 strings inserted into a MySQL database were getting corrupted (and 50% weren't). This is very typical intermediate encoding corruption where some characters are corrupted while others survive.

Picking up where last week’s introduction left off, I’m going to explain how I added Excel export support to one of my products.

As a little background, in case you missed the introduction, I had a couple of feature requests from customers which involved exporting to Excel. The most important was a simple data table export, the files were probably going to be imported into other systems, so no formatting was necessary, and compatibility was the priority. The second was less important, but it was going to be viewed rather than processed, and relied on the formatting of the output.

Say "moajee bockay!" (MOH-JEE-BAH-KAY) when you've got a string of characters displaying incorrectly all scrambled and corrupted. It is a great exclamation word like Eureka! and Geronimo! (but likely to never gain broad usage outside of the programmer community of course!). The Wikipedia entry for Mojibake 文字化け gives a good definition:

Mojibake is a Japanese loanword which refers to the incorrect, unreadable characters shown when a piece of computer software fails to render a text correctly according to its character encoding.

Whenever my company receives a new project, many times we are requested to establish a baseline set of data, which is used as a comparison to make sure something isn't amiss with our setup.

The customer can use this information to draw a conclusion that we are either matching their own data well (meaning our setup must be fairly similiar to theirs), or we are totally off and need to find the problem.

I can tell you that the latter is no fun at all. But it's even LESS fun to get three months into a project only to find that one very small installation error has completely ruined all of the generated data.

I’ve spent the last five weeks on Codesnipers writing about some of the mistakes I made when starting my company, and during the development and marketing of my products. I doubt they’ll be the last, and I’m sure I’ve already made a few more that I just haven’t identified yet, but I’m done for now.

So, to round off this batch (and in case you missed any), I decided to do a quick rundown of the mistakes series so far.

With the explosion of international text resources brought by the Internet, the standards for determining file encodings have become more important. This is my attempt at making the text file encoding issues digestible by leaving out some of the unimportant anecdotal stuff. I'm also calling attention to blunders in the MSDN docs.

For Unicode files, the BOM ("Byte Order Mark" also called the signature or preamble) is a set of 2 or so bytes at the beginning used to indicate the type of Unicode encoding. The key to the BOM is that it is generally not included with the content of the file when the file's text is loaded into memory, but it may be used to affect how the file is loaded into memory. Here are the most important BOMs and the encodings they indicate:

This is the fifth and final instalment of my series on Micro ISV mistakes. If you missed any of my earlier mistakes, they are all still available here on CodeSnipers: #1, #2, #3, and #4.

Starting a business on your own or with a small number of people will be a delicate balancing act. It’s very difficult to stay focused on the things that matter, avoid the plentiful unnecessary distractions, and maintain your momentum. It gets even harder when you realise that it’s not obvious what matters and what can wait.

I had a professor once (actually the finest I’ve ever known) who insisted to his students that “programming is all about the little details”. In my view, what he meant by this is that regardless of the project, methodology, people involved, or what-have-you, if the software doesn’t do what it’s supposed to do, it’s most likely the result of someone, somewhere, failing to account for a few little details. Not that those few details are necessarily easy or hard to identify (or rectify)—the point is that tiny discrepancies in code can have tremendous impact.

Case in point: our ship date is nigh, looming upon us like a brick wall in the middle of an interstate. We’re about ready to let this thing out the door, when we discover oops, that module X that we built two months ago, the one that we’ve had the requirements signed off on three separate times, isn’t quite right. Essentially,Behavior ‘foo’ should result when incoming_date is > existing_date
should read:Behavior ‘foo’ should result when incoming_date is >= existing_date

There are some additional details which only add to the headache, but I’ll spare you those.

This is the fourth in a series of posts on common Micro ISV mistakes. I’ve been using the series as an opportunity to identify where I went wrong and figure out how to get back on track. I’m also hoping someone out there can learn from my mistakes and start out leaner, faster, and stronger.

Have you ever said “We”, when you meant “I”? Have you ever worried what would happen if a customer found out how few staff you actually have?