Thu, 01 Jan 2015

CPAN Pull Request Challenge: A call to the CPAN authors

The 2015
CPAN Pull Request Challenge is ramping up, and so far nearly two hundred
volunteers have signed up, pledging to make one pull request for a CPAN
distribution for each month of the year.

So here's a call to the all the CPAN authors: please be supportive, and if you
don't like for your CPAN distributions to be part of the challenge, please
send an email to neil at bowers dot com, stating your PAUSE ID and the fact
that you want to be excluded.

How to be supportive? The first step is to act on pull requests. If you
don't have time for a review, please say so; getting some response, even if
it's "it'll be some time 'till I get around to reviewing this" is much better
than none.

The volunteers have varied backgrounds; some are seasoned veterans, others
are beginners who will make their first contribution to Open Source. So please
be patient and encouraging.

Sat, 15 Feb 2014

The Fun of Running a Public Web Service, and Session Storage

One of my websites, Sudokugarden,
recently surged in traffic, from about 30k visitors per month to more than 100k
visitors per month. Here's the tale of what that meant for the server
side.

As a bit of background, I built the website in 2007, when I knew a lot less
about the web and programming. It runs on a host that I share with a few
friends; I don't have root access on that machine, though when the admin is
available, I can generally ask him to install stuff for me.

Most parts of the websites are built as static HTML files, with Server Side
Includes. Parts of those SSIs are Perl CGI scripts. The most popular part
though, which allows you to solve Sudoku in the browser and keeps hiscores, is
written as a collection of Perl scripts, backed by a mysql database.

When at peak times the site had more than 10k visitors a day, lots of
visitors would get a nasty mysql: Cannot connect: Too many open
connections error. The admin wasn't available for bumping the
connection limit, so I looked for other solutions.

My first action was to check the logs for spammers and crawlers that might
hammered the page, and I found and banned some; but the bulk of the traffic
looked completely legitimate, and the problem persisted.

Looking at the seven year old code, I realized that most pages didn't
actually need a database connection, if only I could remove the session
storage from the database. And, in fact, I could. I used CGI::Session, which has pluggable backend. Switching to a
file-based session backend was just a matter of changing the connection string
and adding a directory for session storage. Luckily the code was clean enough
that this only affected a single subroutine. Everything was fine.

For a while.

Then, about a month later, the host ran out of free disk space. Since it is
used for other stuff too (like email, and web hosting for other users) it took
me a while to make the connection to the file-based session storage. What
happened was 3 million session files on a ext3 file system with a block size
of 4 kilobyte. A session is only about 400 byte, but since a file uses up a
multiple of the block size, the session storage amounted to 12 gigabyte of
used-up disk space, which was all that was left on that machine.

Deleting those sessions turned out to be a problem; I could only log in as
my own user, which doesn't have write access to the session files (which are
owned by www-data, the Apache user). The solution was to upload a
CGI script that deleted the session, but of course that wasn't possible at
first, because the disk was full. In the end I had to delete several gigabyte
of data from my home directory before I could upload anything again.
(Processes running as root were still writing to reserved-to-root portions of
the file system, which is why I had to delete so much data before I was able
to write again).

Even when I was able to upload the deletion script, it took quite some time
to actually delete the session files; mostly because the directory was too
large, and deleting files on ext3 is slow. When the files were gone, the empty
session directory still used up 200MB of disk space, because the directory
index doesn't shrink on file deletion.

Clearly a better solution to session storage was needed. But first I
investigated where all those sessions came from, and banned a few spamming
IPs. I also changed the code to only create sessions when somebody logs in,
not give every visitor a session from the start.

My next attempt was to write the sessions to an SQLite database. It uses about 400 bytes per
session (plus a fixed overhead for the db file itself), so it uses only a
tenth of storage space that the file-based storage used.
The SQLite database has no connection limit,
though the old-ish version that was installed on the server doesn't seem to
have very fine-grained locking either; within a few days I could errors that
the session database was locked.

So I added another layer of workaround: creating a separate session
database per leading IP octet. So now there are up to 255 separate session
database (plus a 256th for all IPv6 addresses; a decision that will have to be
revised when IPv6 usage rises). After a few days of operation, it seems that
this setup works well enough. But suspicious as I am, I'll continue monitoring
both disk usage and errors from Apache.

So, what happens if this solution fails to work out? I can see basically
two approaches: move the site to a server that's fully under my control, and
use redis or memcached for session storage; or implement sessions with signed
cookies that are stored purely on the client side.

Mon, 31 Dec 2012

iPod nano 5g on linux -- works!

For Christmas I got an iPod nano (5th generation). Since I use only Linux
on my home computers, I searched the Internet for how well it is
supported by Linux-based tools. The results looked bleak, but they were mostly from 2009.

Now (December 2012) on my Debian/Wheezy system, it just worked.

The iPod nano 5g presents itself as an ordinary USB storage device, which you can mount without problems. However simply copying files on it won't make the iPod show those files in the play lists, because there is some meta data stored on the device that must be updated too.

There are several user-space programs that allow you to import and export music from and to the iPod, and update those meta data files as necessary. The first one I tried, gtkpod 2.1.2, worked fine.

Other user-space programs reputed to work with the iPod are rhythmbox and amarok (which both not only organize but also play music).

Although I don't think anything really depends on some particular versions here (except that you need a new enough version of gtkpod), here is what I used:

Correctness in Computer Programs and Mathematical Proofs

The standard
of correctness and completeness necessary to get a computer program to work at
all is a couple of orders of magnitude higher than the mathematical
community’s
standard of valid proofs. Nonetheless, large computer programs, even when they
have been very carefully written and very carefully tested, always seem to
have
bugs.

I noticed that mathematicians are often sloppy about the scope of
their symbols. Sometimes they use the same symbol for two different meanings,
and you have to guess from context which on is meant.

This kind of sloppiness generally doesn't have an impact on the validity of
the ideas that are communicated, as long as it's still understandable to the
reader.

I guess on reason is that most mathematical publications still stick to
one-letter symbol names, and there aren't that many letters in the alphabets
that are generally accepted for usage (Latin, Greek, a few letters from
Hebrew). And in the programming world we snort derisively at FORTRAN 77 that
limited variable names to a length of 6 characters.

Mon, 22 Nov 2010

Harry Potter and the Methods of Rationality

What if Harry Potter had been raised by a loving stepmother? What if his
stepfather was a scientist? What happens when somebody tries to analyze magic
with scientific methods? What happens if an eleven year old boy is too smart
for his own good?

Tue, 08 Dec 2009

Keep it stupid, stupid!

How hard is it to build a good search engine? Very hard. So far I thought
that only one company has managed to build a search engine that's not only
decent, but good.

Sadly, they seem to have overdone it. Today I searched for tagged dfa. I was
looking for a technique used in regex engines. On the front page three out
of ten results actually dealt with the subjects, the other uses of
dfa meant dog friendly area, department of foreign
affairs or other unrelated things.

That's neither bad nor unexpected. But I wanted more specific results, so I
decided against using the abbreviation, and searched for the full form: tagged
deterministic finite automaton. You'd think that would give better
results, no?

No. It gave worse. On the first result page only one of the hits actually
dealt with the DFAs I was looking for. Actually the first hit contained none
of my search terms. None. It just contained a phrase, which is also sometimes
abbreviated dfa.

WTF? Google seemed to have internally converted my query into an
ambiguous, abbreviated form, and then used that to find matches, without
filtering. So it attempted to be very smart, and came out very stupid.

I doubt that any Google engineer is ever going to read this rant. But if
one is: Please, Google, keep it stupid, stupid.

I'm fine with getting automatic suggestions on how to improve my search
query; but please don't automatically "improve" it for me. I want to find what
I search for. I'm not interested in dog friendly areas.

Sat, 05 Dec 2009

Doubt and Confidence

As a programmer you have to have confidence in your skills, to some
extent, and at the same time you have to constantly doubt them. Weird, eh?

Confidence

You need some level of confidence to do anything efficiently.
Planning ahead requires confidence that you can achieve the steps
on your way.

As a programmer you also need some confidence with the language,
libraries and other tools you're using.

If you program for money, you also have to assess what kind of programs
you can write, and where you might have problems.

Doubt

In the process of programming you make a lot of assumptions, some of the
explicit, some of them implicit. If you want to write a good program, it's
essential that you are aware of as many assumptions as possible.

When you find a bug in your program, you have to challenge previous
assumptions, and that's where doubt comes in. You not only suspect, but
you know that at least one of the assumptions was false (or maybe
just a bit too specific), and you know that you did something
wrong.

Sometimes programmers make really stupid mistakes which are rather tricky
to track down. That's when you have to question your own sanity.

One example (that luckily doesn't happen all that often to me) is when I
edit my program, and nothing seems to change. Nothing at all. Depending on
the setup it might be some cache, but something it is even more
devious - for example I didn't notice that the console where I edit and
the console where I test are on different hosts - and thus the edits
actually have no effect at all.

After having done such a thing once or twice I adopted the habit of just
adding a die('BOOM'); instruction to my code, to verify that
the part I'm looking at is actually run.

These are moments when I question my own sanity, thinking "how could I
have possibly done such a stupid thing?". Doubt.

The same phenomena applies when doing scientific research: since you
usually do things that nobody has done before (or at nobody has published
about it yet), you can't know the results beforehand -- if you could, your
research would be rather boring. So you have no external reference for
verification, only your intuition and discussion with peers.

Sat, 10 Oct 2009

Fun and No-Fun with SVG

Lately I've been playing a lot of with SVG, and all in all I greatly
enjoyed it. I wrote some Perl 6 programs that
generate graphical output, and being a new programming language it doesn't
have many bindings to graphic libraries. Simply emitting a text description of
a graphic and then viewing it in the browser is a nice and simple way out.

I also enjoy getting visual feedback from my programs. I'd even more enjoy
it if the feedback was more consistent.

I generally test my svg images with three different viewers: Firefox
3.0.6-3, inkscape (or inkview) 0.46-2.lenny2 and Opera
10.00.4585.gcc4.qt3. Often they produce two or more different renderings of
the same SVG file.

This SVG file first defines a path, and then references it twice: once a
text is placed on the path, the second time it is simply referenced and given
some styling information.

Rendered by Firefox:

Rendered by Inkview:

Rendered by Opera:

Three renderers, three outputs. Neither Firefox nor Inkview support the
textLength attribute, which is a real pity, because it's the only
way you can make a program emit SVG files where text is guaranteed not to
overlap.

If you scale text in Inkscape and then put it onto a path, the scaling is
lost. I found no way to reproduce opera's output with inkscape without
resorting to really evil trickery (like decomposing the text into paths, can
then cutting the letters apart and placing them manually). (Equally useful is
the dominant-baseline attribute, which Inkscape doesn't support
either).

The second difference is that only Firefox shows the shape of the path.
Firefox is correct here. The SVG specification clearly
states about the use attribute:

For user agents that support Styling with CSS, the conceptual deep cloning of the referenced element into a non-exposed DOM tree also copies any property values resulting from the CSS cascade [CSS2-CASCADE] on the referenced element and its contents. CSS2 selectors can be applied to the original (i.e., referenced) elements because they are part of the formal document structure. CSS2 selectors cannot be applied to the (conceptually) cloned DOM tree because its contents are not part of the formal document structure.

Sadly it seems to be a coincidence that Firefox works correctly here. If
the styling information is moved from the path to the
use element the curve is still displayed - even though it should
not be.

Using SVG feels like writing HTML and CSS for 15 year old browsers, which
had their very own, idiosyncratic idea of how to render what, and what to
support and what not.

Just like with HTML I have high hopes that the overall state will improve;
Indeed I've been told that Firefox 3.5 now supports the
textLength attribute. I'd also love to see wide-spread support
for SVG animations, which could replace some inaccessible flash
applications.

Tue, 04 Aug 2009

Goodby Iron Man

<update> (from 2009-08-23)
It turned out that my disappearance
on the ironman blog feed was due to a broken RSS feed. Matt S. Trout tried to
inform me by blog comment, my blog marked it as spam and swallowed it.

So now we talked on IRC, clarified things, and I'm back in the game.
</update>

So I accepted the Iron
Man blogging challenge a few month ago. And last week I discovered that my
blog was gone from their feed. For the second time. Without any
notification.

The first time they had a good reason: the date tags in my RSS feed were
goofed; still I'd thought it would be nice to at least notify me of such a
removal. After some mails back and forth I was able to fix it; after the
second removal without any notification I'm simply fed up and don't want to
investigate any more energy into this.

Still I'll continue to follow the collected RSS feed, there are still many
interesting blogs to be read there.

Iron Man Challenge - Am I a Stone Man?

I'm missing the
things announced on their website: a way to find out to which level you
made it, a monthly selection of best blog posts, and all these other things
that were designed to create some competition, and more fun.

Don't get me wrong, I like to read the blog of my fellow Perl programmers,
and it motivates me to write more often myself. But that's not all that was
promised to us.

One thing I'd like to add about the content, though: So far most of
what I read was
very good and informative, but it was all text. I know it's not easy to find
nice on-topic programming pictures, and use.perl.org doesn't even allow the
inclusion of pictures in posts, and I don't do it often myself, but having
more picture or charts would be nice.

Mon, 01 Jun 2009

Why Design By Contract Does Not Replace a Test Suite

"Design By Contract" (DBC) usually refers both to very sophisticated
assertion systems (for example in which assertions are inherited along with
the methods to which they belong), and to the practice of using such
assertions extensively, not only for quality assurance but also as a form of
documentation.

When I was mostly programming in Eiffel some years ago, I liked DBC very
much, and I still think that it's a very good idea, and that more programming
language should offer good support for it.

However there's one comment that I've seen frequently on the web, in blogs
and on IRC. Often DBC evangelists say something along these lines: "We
have DBC, we don't need a test suite". I find such comments incredibly
stupid, and here I want to write down why.

Code needs to run

If you want to verify that the code does what you want, you have to
actually run it - otherwise the assertions won't be triggered, and are
worthless as a verification tool.

You don't have to just run it, but should, when possible, cover every code
path - just like you'd do it when you write tests. Doing that manually
requires much work, so you still need a test suite that you can run to verify
that some changes didn't break anything.

Examples are easy, general rules are hard

Test cases are just example input, paired with the expected output. Usually
it's pretty easy to come with examples, so writing tests is also easy, even
for corner cases.

On the other hand assertions are rules that have to hold for all possible
input data, so to formulate them, you have to consider the general case -
that's usually rather hard, so the lazy programmer leaves out the hard
cases.

A simple example: suppose you've written a subroutine that adds two numbers
(for example for a bignum library). Writing assertions for the general case of
addition is quite hard if you can't trust your subtraction routine; so the
only things you can really do is to check the signs (positive number plus
positive number is positive etc.), but that won't catch any off-by-one
errors.

So you should also write tests; tests like add(3, 4) == 7 are
trivial to come up with, and catch potential errors.

Conclusions

Design by Contract and testing should go hand in hand so that the tests
exercise as many code paths as possible, and should cover those areas that are
hard to validate with assertions.

Thu, 18 Dec 2008

My Diploma Thesis: Spin Transport in Mesoscopic Systems

Sometimes people ask me what I'm doing right now, and I tell them
"I'm writing my diploma thesis on mesoscopic spin transport", and they
know just as much as before. So here I want to explain what that means.

Mesoscopic systems

A mesoscopic system is one that is larger than a few nanometers, but
still small enough that you have to care about quantum effects.

That's not a very precise definition, so I'll try again: Consider a
metallic wire. For macroscopic systems (ie the ones that we are used to
in day-to-day live) you might know that the electrical resistance of
such a wire increases linearly as you increase its length, and decreases
linearly if you increase its cross section.

This is very intuitive, because electrical resistance describes how
hard it is for an electron to travel through our wire. If the wire is
longer, it sees more obstacles, so the resistance is higher. If the wire
has a larger cross section, it's easier for the electron to find a way
that's not blocked, so the resistance is smaller. That's called
Ohm's law.

These relations aren't true anymore for rather small systems. If you
have a very thin wire, say 20 nanometers, and increase its diameter by
another nanometer, the resistance might not change at all. Then you
increase its diameter by another nanometer, the resistance suddenly jumps
down by a few percent.

All these systems that are too small for Ohm's law to apply are
called mesoscopic. All mesoscopic effects have to be explained
with quantum physics, at least at some point.

Electron Spin

Electrons have something called Spin. Everybody knows that
it has a charge, and it acts as if it rotated around its own axis very
fast. So it looks like a current which runs in a circle, and that
creates a small magnetic field.

If you try to measure the magnetic field of one electron, you will
only ever get two possible values, which we call spin up and
spin down.

Spin Optics

In a semiconductor, one can split up a beam of electrons into two beams
of spin-up and spin-down electrons, just like in optics with polarized
light. That splitting can be influenced by an external voltage, like a
classical transistor.

The topic of my diploma thesis is to figure out how such spin polarized
electron beams behave in certain semiconductor systems.