Hi, Google. You kind of failed to help me out earlier when I was asking
about "how to set a global mail profile for database mail in Microsoft SQL
2005." Here's what I wish you had said:

First of all, "Database mail" ("DBMail" or "Sysmail") is not the same as
"SQL mail" ("SQLMail"). They're both stupid and overly complex, but DBMail
is newer and slightly less stupid.

SQLMail uses an installed MAPI provider on your system to send mail, which
means you need such a thing, possibly Outlook. DBMail apparently ignores
your MAPI provider entirely. So if you find an article that says you need
to install Outlook first, just ignore it; it's not true.

People keep posting articles like Why
China Needs US Debt. I think most of us know enough to disregard random
opinion pieces written by lobbyists, and most of us who read such an article
will get a "wait, that can't be right" feeling. But what exactly is
wrong? China obviously does need us, right? Or why would they trade
with us?

I first started to understand the problem while I was watching the Canadian
federal election debates and someone (it might have been the Green Party
leader) said something like, "We have to cut back on our oil exports! It's
killing Canada's manufacturing industry!"

...and I did a double take. Wait. What?

I had to look into it a bit before I understood. What was happening was
that the increased oil prices were causing a flood of activity into Canada's
Oil Sands projects, and thus a massive increase in oil exports. Increased
exports were raising the value of the Canadian dollar (which, importantly,
is not pegged to any other currency). A higher Canadian dollar makes
it harder for people from other countries to buy Canadian stuff: not just
oil, but anything. And unlike with oil, our other industries didn't have a
massive natural (read: Canada's really big)
competitive advantage. Which means that if our oil exports expand
massively, it kills our manufacturing sector.

The success of one industry, unrelated except by trading in the same
currency,(1) can harm another industry. And that
realization, to me, was a new and important one.

Now, back to China. Their currency, in light of being pegged to the US
currency, is essentially the same as the US currency. What does that
mean? Success in exports in one sector (China manufacturing) can damage the
market in another sector (US manufacturing) even if they manufacture totally
different things, simply because the successful market artificially raises
the prices of the unsuccessful market.

Now, pegging your currency can be kind of expensive. China does it by
stockpiling truckloads of US dollars. Well, more precisely, they buy US
debt, which is essentially the same thing. What this really means is that
China takes much of the profit from its exports and mails them back to the
US (as "debt"), so that the US can afford to buy more Chinese stuff.

In the article I linked to above, the claim is that
China needs US debt to keep increasing, because there's simply
nothing else in the world big enough to spend all those US dollars on.
And that's true, in a sense, if you believe that money has
intrinsic value. Of course, China is smart enough to know that it
doesn't.

...which is where it gets even stranger.

Even though China knows money is worthless, they keep shipping their
perfectly valuable manufactured goods to us in exchange for worthless pieces
of paper.(2) How dumb is that?

Not dumb. Brilliant.

Our whole theory of economics is based on two axioms, one of which is that
human
wants are unlimited. But we're starting to figure out that's not really
true. As a society, we're slowly realizing that more consumption doesn't
lead to more happiness. So what does?

For a lot of people, maybe the secret to daily happiness is just this: a
stable job and the feeling that you're doing it well and helping society.

By exporting stuff to us by the crapload - and "oh darn, poor
us, we're such victims" denominating their wealth in US dollars - they
ensure that they have jobs and happiness. We're the helpless,
unproductive, soulless consumers.

Call it victory by superior philosophy.(3)

Footnotes

(1) Of course, our manufacturing industry also uses a lot of
energy, and high energy prices are bad for them too. But that's true for
everyone's manufacturing industry, so it's not automatically a
competitive disadvantage.

(2) Thought experiment: imagine China as a black box with inputs
and outputs. From the point of view of China, sending us useful goods
(which we'll use up and then dump in our landfills) is a lot like just
taking those goods and dumping them in the ocean. As far as China is
concerned, nothing would be very different if all the ships just sank
before they arrived here.

(3) It's a strange war, though: you don't have to worry about
them invading us if they win. What would they steal? Our consumers?

My friend from high school and one-time employer, David Slik, made a
presentation about the company he founded and still works for. Bycast makes high-end clustered "cloud" storage
systems that are apparently so reliable that some of their enterprise
customers have stopped making backups altogether... after thoroughly
testing Bycast's fault recovery mechanisms, of course.

Call me crazy, but I've never really seen the point of so-called "parser
generators" and "lexical analyzer generators" in real life. Almost any file
has syntax that's so simple, it's easy to just parse it yourself. And
languages that are more complicated or have tight parser performance
requirements - like C++ compilers or Ruby interpreters - tend to have
hand-rolled parsers because the automatic parser generators can't do
it.

So who benefits from automatic parser generators? I don't know. I feel
like I'm missing something.

This feeling came up again the other day when I found I had to parse and
produce some XML files at work - in Delphi. Having seen lots of advice in
lots of places that "the first thing a new programmer does with XML is to
try, and fail, to write his own XML parser," I was hesitant. Okay, I
thought. Why not look into one of those well-known, fancy pants XML parsers
that will surely solve my problem in two seconds flat?

Well, I looked into them. Much more than two seconds later, I emerged,
horrified. How can you guys possibly make parsing a text file so
complicated? Why, after adding your tool, does my project now seem
more difficult than it did when I started?

I still don't know. Look guys, I really tried. But I just don't understand
why I'd want to use a DTD. Or twelve layers of abstraction. Or the
"structured" way you completely reject (with confusing error messages)
almost-but-not-quite valid XML files, in clear violation of Jon Postel's robustness
principle.

So I broke down and wrote an XML parser myself. In Delphi. In about 500
lines. In an afternoon. I'm sure I left out major portions of the XML
spec, but you know what? It parses the file the customer sent me, and the
Big Fancy Professional XML Library didn't, because it said the file was
invalid.

I guess that makes me a clueless newbie.

But back to tokenizers

As an odd coincidence, someone I know was doing some (much less redundant)
work on parsing a different file format at around the same time. As anyone
who has done parsers should know, most parsers are divided into two main
parts: lexical
analysis (which I'll call "tokenizing") and parsing.

I agree with this distinction. Unfortunately, that seems to be where my
formal education ends, because I just can't figure out why lexical analysis
is supposed to be so difficult. Almost all the lexical analyzers I've seen
have been state machines driven by a single main loop, with a whole
bunch of if statements and/or a switch/case statement and/or function
pointers and/or giant object inheritance hierarchies.

Sure enough, the person I was talking to was writing just such a tokenizer
in python - with lambdas and all the rest.

The problem is I just don't understand why all that stuff should be
necessary. Traditional lexical analysis seems to be based on the theory
that you need to have a single outer main loop, or you'll be inefficient /
redundant / impure. But what I think is that loop constructs are generally
only a single line of code; it doesn't cost you anything to put loops in
twelve different places. So that's what I did.

I suppose that makes me a newbie. But it works. And my code is more
readable than his. In fact, when I showed him a copy, he was amazed at how
simple it is. He actually called it brilliant. Seriously.

To be honest, I still feel like I must be missing something. And yet here
we are.

...so without further ado, my XML tokenizer, in 62 lines of Pascal. For
your convenience, I have highlighted the blasphemous parts.

Quick! Take a look at the following snippet of HTML, and tell me what's
wrong with it.

<div align=right>Hello, world!</div>

If you said, "It's completely invalid and unparseable because you forgot the
quotes around 'right'!" then you're... hold on, wait a second.(1)

Unparseable? Every web browser in history can parse that tag.
Non-conforming XML, yes, but unparseable? Hardly. There are millions of
web pages where people forgot (or didn't bother) to quote the values of
their attributes. And because those pages exist, everyone who parses HTML
has to support that feature. So they do.

That's the difference between HTML and XML. With HTML, programmers answer
to end users. And the end users are very clear: if your
browser can't parse the HTML that every other browser can parse, then I'm
switching to another browser.

XML is different. XML doesn't have any "end users." The
only people who use XML parsers are other programmers. And programmers,
apparently, aren't like normal people.

Real, commercial XML parsers, if fed a tag like the one above, would give me
an error message. They would tell me to go back and fix my input.
Apparently, I'm a bad person for even suggesting that we should
try parsing that file.

Now, as it happens, the XML parser I
wrote in 500 lines of Pascal a few days ago would not reject this
input. It would just pretend the quotes were there. In fact, if my program
parses the file and then you ask it to print the XML back out, it'll
helpfully add the missing quotes in for you.

Let's phrase this another way. The painstakingly written,
professional-grade, "high quality" XML parser, when presented this input
that I received from some random web site, will stab me in the back and make
me go do unspecified things to try to correct the problem by hand. Avery's
cheeseball broken XML parser, which certainly doesn't claim to be good or
complete, would parse the input just fine.(2)

This, an innocent bystander might think, would imply that my parser is the
better one to use. But it's not, you see, because, as cdfrey points
out:

Interoperability is hard. Anyone can write their own parsers. And
everyone has. That's why the monstrosity called XML was invented in the
first place.

It all starts with someone writing a quick and dirty parser, thereby
creating their own unique file format whether they realize it or
not.(3) And
since they probably don't realize it, they don't document it. So the next
person comes along, and either has to reverse engineer the parser code, or
worse, guess at the format from existing examples.

Got it? By creating a permissive parser that just corrects simple input
errors, I've made things worse for everybody else. I would make the
world a better place if my parser would just reject bad XML, and then
everyone would be forced to produce files with valid XML, and that
would make life easier for people like me! Don't you see?

Well, no. There's a fallacy here. Let's look at our options:

Option 1: Bob produces invalid XML file and gives it to Avery. Avery
uses professional-grade fancy pants parser, which rejects it. Avery is sad,
but knows what to do: he phones up Bob and asks him to fix his XML producer.
Bob is actually a guy in Croatia who hired a contractor five years ago to
write his web site for him, and doesn't know where to find that contractor
anymore, but because he knows it's better for the world, he finds a new
contractor who fixes the output of his web site. Three weeks later, Bob
sends a new XML file to Avery, who is now able to parse it.

Option 2: Bob produces invalid XML file and gives it to Avery.
Avery's permissive parser that he wrote in an afternoon reads it just fine.
Avery goes on with his work, and Bob doesn't need to pay a contractor.

Option 3: Bob produces valid XML in the first place, dammit, because
he made sure his contractor ran his program's output successfully through a
validator before he accepted the work as complete. Avery parses it easily,
and is happy.

Now, obviously option 3 is preferable. The problem is, it's also not a real
option. Bob already screwed up, and he's producing invalid XML. Avery has
received the invalid data, and he's got to do something with it. Only
options 1 and 2 are real.

Now, XML purists are telling me that I should pursue option 1. My question
is: why? Option 1 keeps me from getting my work done. Then I have to go
bother Bob, who wouldn't care except that I'm so obnoxious. And now he has
to pay a contractor to fix it. The only reason I would take option 1 is if
I enjoy pain, or inflicting pain on others. Apparently, lots of programmers
out there enjoy pain.

Meanwhile, option 2 - the one that everybody frowns upon - is painless for
everyone.

The usual argument for option 1 is that if enough people do it, then
eventually people will Just Start Producing Valid XML Dammit, and you won't
ever have this problem again. But here's the thing: we have a world
full of people trying option 1. XML is all about the people
who try option 1. And still Bob is out there, and he's still
producing invalid XML, and I, not Bob, am still the one getting stabbed in
the back by your lametarded strict XML parsers. Strict receiver-side
validation doesn't actually improve interoperability, ever.

As programmers, we've actually known all this for a long time. It's called
Postel's Law, in
honour of Jon Postel, one of the inventors of the Internet Protocol. "Be
liberal in what you accept, and conservative in what you send."

The whole Internet runs on this principle. That's why HTML is the way it
is. It's why Windows can talk to Linux, even though both have lots of bugs.

I have my own way of phrasing Postel's law: "It takes two to
miscommunicate."

As long as either side of any transaction is following Postel's law -
either the sender strictly checks his XML for conformance or the
receiver doesn't - the transaction will be a success. If both sides
disregard his advice, that's when you have a problem.

Yes, Bob should have checked his data before he sent it to me. He didn't.
That makes him a bad person - or at least an imperfect one. But if I refuse
the data just because it's not perfect, then that doesn't solve the problem.
It just makes me a bad person too.

Footnotes

(1) People who, instead, were going to complain that I should
avoid the obsolete HTML 'align' attribute and switch to CSS would be the
subject of a completely different rant.

(2) Note that there's plenty of perfectly valid XML that my
cheeseball incomplete XML parser wouldn't parse, because it's cheeseball and
incomplete. The ideal parser would be permissive and complete. But
if I have to choose one or the other, I'm going to choose the one that
actually parses the files I got from the customer. Wouldn't you?

(3) If you didn't catch it, the precise error in cdfrey's
argument is this: You don't create a new file format by parsing
wrong. You create a new file format by producing wrong. Ironically,
a lot of people use strict professional-grade XML parsers but seem
to believe that producing XML is easy.

Side note

By the way, even strict XML validation doesn't actually mean the receiver
will understand your data correctly. It's semantics vs. syntax. You can
easily write a perfectly valid HTML4-Strict compliant document and have it
render differently in different browsers. Why? Because they all
implement the CSS differently. Web browser interoperability problems
actually have nothing to do with HTML parsing; it's all about the rendering,
which is totally unrelated. It's amazing to me how many people think
strict HTML validation will actually solve any real-world problems.

Unsurprisingly, my earlier comments about XML and Postel's Law
caused a bit of a flamewar in the various places that have flamewars about
these things.

What did surprise me, though, is that people have mostly heard of
Postel's Law... they just think it's wrong. That's actually not what I
expected; I naively assumed that people were doing things they way they do
because they simply didn't know any better.

Oh well, live and learn.

As for me, I'm a huge Postel fan, and I have been for years. For example, a
little over five years ago, I wrote my own replacement for the curses
(text-mode display handling) library for the same reason that I recently
wrote my own permissive (albeit incomplete) XML parser: because nobody else
was following Postel's Law.

You can still read my
original article about it. Probably people will flame me now for being
stupid enough to write my own replacement for curses, just like I was stupid
enough to write my own XML parser; however, the fact remains that my
terminal management code remains in production to this day, and since day 1
of its deployment, it has greatly reduced the amount of time Nitix tech support people spend dealing with
terminal emulator problems when people use our command-line interface. On
the other hand, as far as I know, it has never caused a single
problem.

A couple of years ago I started to revisit the problem in a more general
way, when I made a patch
to ncurses to let it use liberal input handling. However, I got
sidetracked since then and never had time to fine-tune the patch to get it
integrated. Oh well.

Next time you press DEL or HOME or END in a Unix program and it doesn't
work, think of Jon Postel.

Sorry for the late notice, all! Today, if all goes well, I'll be at StartupCampWaterloo in
Waterloo, Ontario, Canada. Assuming there's space, I'll see about
presenting the unannounced new project I've been working on. Come see!

My recent posting of some "controversial"
source code seems to have piqued people's interest, what with its coverage
on YCombinator and Reddit. In the interests of calming things down a
bit, here's some hopefully non-controversial code that I've found very
useful.

runlock
is a simple perl script that creates and locks a lockfile, then runs
whatever command line you give it. If the lockfile is already locked, it
doesn't run the program and exits immediately. We can use this in a
frequently-running cron job, for example, to ensure that if the job
occasionally takes a long time to run, we don't accidentally cause a backlog
by starting it over and over again.

Sounds simple? Well, it's short, but the details are pretty tricky. Let's
go through the source code and look at some of its interesting parts.

#!/usr/bin/perl -w
use strict;
use LockFile::Simple;

You probably want to know about the LockFile::Simple perl module. It does
pretty much everything you want to do with lockfiles. Unfortunately, its
defaults are insane and will get you into a ton of trouble if you use them.
It's pretty obvious that the author of this module has learned a lot over
time.

Above we check to make sure the argument list is okay. Nothing too special
here, except for one thing: we return 127 in case of an error, because the
more common error codes might be returned by the subprogram we're running.
runlock is intended to be used whenever you might normally run the given
program directly, so it's important not to eat the return code of the
subprogram.

Here's the first tricky bit: the correct options to LockFile::Simple.
"-stale=>1" means that we should "automatically detect stale locks." Now,
this sounds like it's obviously a good thing, but is for some reason not the
default.

The way this sort of lockfile works is that you use a set of atomic
operations to write your pid (process id) to the lockfile. Then, other
programs that want to check if the lock is valid first check if the file
exists, then open it and read the pid, then "kill -0 $pid" (send a no-op
signal to the process) to see if it's still running. If the process is
dead, they delete the lockfile and try to create a new one.

If you don't enable "-stale=>1", the LockFile library will just abort if the
file exists at all. This means your system will require manual intervention
if the locking process ever dies suddenly (eg. by "kill -9" or if your
system crashes), which is no fun.

The next option, "-hold=>0", disables a trojan horse extremely evil option
that is enabled automatically when you set "-stale=>1". The "-hold" option
sets the maximum time a lock can be held before being considered
stale. The default is 3600 seconds (one hour). Now, this sounds like
it might be a useful feature: after all, you don't want to let a lock file
just hang around forever, right?

No! No! It's a terrible idea! If the "kill -0 $pid" test works, then you
know the guy who created the lock is still around. Why on earth
would you then consider it stale, forcibly remove the lock, and start doing
your own thing? That's a course that's pretty much guaranteed to get
you into trouble, if you consider that you've probably created the lockfile
for a reason.

So we set "-hold=>0" to disable this amazing feature. The only way we want
to break a stale lock is if its $pid is dead, and in that case, we can
happily break the lock immediately, not after an arbitrary time limit.

Instead of using $lm->lock(), we use $lm->trylock(), because we want to exit
right away if the file is already locked. We could have waited for the lock
instead using $lm->lock(), but that isn't what runlock is for; in the above
cronjob example, you'd then end up enqueuing the job to run over and over,
when (in the case of cron) once is usually enough.

The above is the part where we run the subprocess, wait for it to finish,
and then unlock the lockfile.

Why is it so complicated? Can't we just use system(@ARGV) and be done with
it? (Perl has a multi-argument version of system() that isn't insecure,
unlike in C.)

Unfortunately not. The problem is signal handling. If someone kills the
runlock program, we need to guarantee that the subprocess is killed
correctly, and we can't do that unless we know the subprocess's pid. The
only way to get the pid is to call fork() yourself, with all the mess that
entails. We then have to capture the appropriate signals and pass them
along when we receive them.

The "# NOTREACHED" section simply indicates that that section of the code
will never run, because both branches of the about if statement terminate
the process. It's an interesting historical point, however: the comment
"NOTREACHED" has been used in programs for years to indicate this. The
practice started in C, but seems to have migrated to perl and other
languages. I think it used to be a signal to the ancient "lint" program in
C that it should shut up and not give you a warning.

print STDERR "Still locked.\n";
exit 0;

Finally the very last part of the program exits and returns a success code.
We only get here if we didn't manage to create the lockfile.

It seems a little weird to return success in such a case, but it works: the
primary use of runlock is in a cron job, and cron sends you annoying emails
if the job returns non-zero. Since the fact that the previous run is still
running is not considered an error, it works much better to return zero
here.

If you use cron2rss, your
captured output will include the "Still locked" message anyway.

Now that I've done a presentation at StartupCampWaterloo, which
was recorded (presumably to show up on the Internet sometime), and I've been
accepted to do it again at DemoCampGuelph8 next week, I
guess the cat's out of the bag.

EQL=Data is a project that we've been
working on for a little while. The concept is
simple: easy, one-click replication of your Microsoft Access databases
between any users who have a copy of the same database file.

Why Microsoft Access? Good question! Because:

it's used by (at least) hundreds of thousands of people;

to this day, it remains the easiest way in the world to create a database-driven app, as far as I know;

it's severely underserved as an application platform. Even Microsoft seems more like they're trying to kill it than support it.

Now, in fact, Microsoft Access already includes a "replication" feature that
reputedly works fairly well. However, "fairly well" has quite a lot of
stipulations, including the fact that you either have to never, ever move
the "replication master" file from one place to another, and you have to
sync either on a LAN via Windows file sharing (samba) protocol or by
installing a weird server and opening ports on your firewall, and it's slow.

The short version is that almost nobody uses Access replication because it's
too complicated.

So with EQL=Data, you can replicate your Access databases all around, with
no need for weird firewall settings, no reliance on a central fileserver,
and access from anywhere on the Internet. Your copy of the database
continues to work (both readable and writable) even if you're offline.

But here's the weird thing:

Nobody really cares about that.

I mean, they all think it's cool, but it doesn't grab anyone's attention.
Perhaps because a UI consisting of a single pushbutton isn't really all that
exciting.

What people actually seem to care about is that when you sync your
data to our servers and back, our server keeps a copy under version control,
imports it into a sqlite database, and lets you search/browse it (read only)
on the web.

Apparently, people would absolutely love to be able to maintain their data
in Access, then publish it automatically (without any programming) to the
web. This is a Big Thing. Product lists, price lists, reseller
lists, and so on. They change constantly, and you want to edit them in a
Windows UI, and you can't afford to hire programmers, but you'd really like
to see them on the web.

Okay, we can do that:

In fact, I'd show you an example right now, but we're currently still in
beta and the super-cool, AJAXy, searchable, queryable, linkable "embed this
dynamic table in a web page" widget isn't quite ready for you yet. It will
be pretty soon.

In the meantime, if you think this is remotely interesting, please visit our download/signup page and add
yourself to the waiting list. It'll make me feel all squishy. You could
also send me an email. Thanks!

Random side note

When you replicate your data to our server, it also gets backed up and
revision controlled automatically using git. This
was a pretty fun project, since it involved extracting useful stuff out of
all-in-one binary database+code blobs and turning it into diffable/mergeable
text. Version control of Access databases is also something Microsoft
has already done, albeit in a startlingly half-assed way. Packaging
git's user interface for the Microsoft Access demographic would be...
challenging... so it's unclear how far we'll take the UI for this. Less is
more, and all that.