Month: January 2015

Bugs in software have many sources, but ones caused by erroneous assumptions are among the most difficult to find. The problem is that these assumptions are baked in the code, so the reader will tend to take them in at the same time as the code.

Consider the following code that creates some simple histogram of input values. It seems reasonable to assume that the resulting dictionary will have a limited size, with at most steps entries. This assumption is incorrect.

This code can return dictionaries which are much larger than steps entries. The problem here lies in the clamping logic. The assumption here is that a value that is not smaller than min and not larger than max is a value in the range ⟦min…max⟧.

This assumption is incorrect in Python (and probably many other languages), because in floating point, there is NaN. Now NaN in Python has many properties that break the assumptions of the code above:

Any floating point operation that has NaN as an argument returns NaN.

float('NaN') < X → False ∀ X

float('NaN') > X → False ∀ X

float('NaN') == float('NaN') → False

The last property means that using floating point values as keys in dictionaries is a bad idea, as the equality operator is used to determine if two keys are the same. This means that you while you can add a value with key NaN, you can never access it by key, because that key is not equal with itself. So you can also add multiple values with the same key, and they will each have a different entry in the dict:

d[float('NaN')] = 4
d[float('NaN')] = 5
d → {nan: 4, nan: 5}

Python will let you use anything as a dictionary key, even if the key is mutable, or does not implement the equality operator properly. So if you feed the following input to the bucket function above, you will a return dictionary with 1000 entries, regardless of the value of the step parameter.

Programmers of my generation grew up with 8 bit computers, by today’s standards, the resources available on these machines were very scarce, memory measured in kilobytes, processor speed in a few megahertz. This meant that very little was kept in memory, but instead generated on the fly.

The typical example was the content of the screen, which was not buffered, but instead composited from various source, typically the video memory and the sprite registers. This also meant that you could change these sources during the rendering to get richer results (or bugs). This also meant that a screen capture was something difficult: you could not just grab the content of the memory, instead you needed to reconstruct the state on the screen.

Making sure your code was in sync with the screen’s refresh rate was difficult, this is why newer systems use one or more buffers to store the intermediate renderings: the image on the screen is never live, but instead the one recorded a few milliseconds earlier.

Having enough memory to store things made a lot of things possible, and loading everything into one place to process it is much more easy, but this is also a trap: however much RAM you have, there are datasets that won’t fit. Even with virtual memory, automatic garbage collection and similar mechanisms, you still need to have a clue if stuff will fit in main memory.

Most system have mechanisms to handle data structures that are ephemeral, partially in memory, or fragmented: generator, iterators, mem-mapped containers and ropes instead of strings. While it certainly does not make sense to use these all the time, knowing about them is really useful if you want to process non trivial amounts of data.

Maybe my generation was a privileged one, as we were not soothed by teachers talking with starry eyes into believing that those problems would be handled by garbage collection and high-level languages. Come to think about it, I had those teachers, I just never believed them…

I used to think of Perdido Street Station as Steampunk version of Terry Pratchet’s universe, but along the Diskworld® novels, the world has undergone its own industrial revolution. After a light based telegraph system (the clacks) Raising Steam tells the story of the appearance of trains.

It is easy to make parallels with the Going Postal which got its own movie: same main character (Moist von Lipwig) and same background theme (a new technology emerging), the problem is, Going Postal had original ideas: the clacks are not just a simple telegraph, the community associated with them was inspired by programmers, not Victorian telegraph operators, and most of the involved characters were new.

There is no such originality in Raising Steam, the central character is again Most von Lipwig, but nearly every Discworld® characters pops up in the story. The engineer behind the steam engine, Simnel, is a very straight engineer, except when he could be trapped, then he reveals himself to be smart about the world. The steam engine is just that, a regular steam engine, no variation, same thing about the people in the railway: they come out straight of the Victorian book of clichés.

The book could be divided into two parts, the first one is just a description of the building up of the railway, it is well written and pleasing to read, but there is no real suspense because there is basically no antagonist. The second part is basically some simple conspiracy plot which seems to have been bolted on the previous dissertation to make it look like there is an actual story. This part could also be called the Parade of Mary Sues, all of Pratchet’s main characters are on a train a beating the shit of the weak opposition, even the crook reveals himself to be a tough fighter. Of course, there is not much tension as the opposition seems to be, at best, an afterthought.

The second part also contains a set of sudden reveals, but they concern secondary characters and feel very artificial, so does the final deus machina moment. In general I felt that by pushing all the interesting characters from previous books into this story, it watered them down or changed them in contradictory fashion.

One annoying pattern in science fiction and fantasy books is that as the authors get older, they often try to make all the bits and pieces of their work fit together, this seems to be motivated more by a will to make their life’s work a coherent piece, not to write a good story. Raising Steam clearly falls into that category of books.

Terry Pratchet often had trouble with his book’s endings, but this is the first time I really though come-on while reading the last part. So while the writing style is still as enjoyable, and I really liked the first part, I can’t really recommend this book, which is basically a pale shadow of Going Postal.

When the HTTP protocol was initially defined, it was extremely simple: the client opens a socket, sends the GET command followed by the url, and gets in returns headers, two empty lines and then the content of the page. A simple protocol for a simple job: distribute academic content. The protocol has been refined a bit in 1999, but stayed essentially the same.

Version 2.0 of the HTTP protocol is taking shape, and it is largely based on SPDY. Where version one was simple and ended up changing the world, version two is rather complex and aims at being more efficient. This makes sense: the goal is to keep the web while better using network resources, but there is a certain irony to see HTTP adopt features that were present in some networking protocols but deemed to complicated: binary format, multiplexing.

For sure the environment were HTTP 2 will be released is very different from the one version 1 saw. The protocol that underpins HTTP is TCP, but that protocol is becoming less and less visible, and less and less usable. In 1991 any new protocol would just use a new port, nowadays the only port that is universally usable is the 80, which is reserved… for HTTP. HTTP 1.0 delegated multiplexing to TCP, i.e. a browser would open multiple connections in parallel, HTTP 2.0 is all about squeezing the maximum out of one connection.

As often with computer science, the lower level of the software stack fossilize and new systems are built on top of them while the lower level are slowly forgotten. Maybe this is the case with all human built structures?

With each crop of new languages, comes a flurry of articles explaining how much better they are than the incumbent (C++). The fact that the incumbent is still the same after all these years, is an interesting sign, but I find it more interesting that the big argument in favour of these languages is the absence of pointers and the related errors.

This has been the main argument of all the new languages for as long as C++ has been king of the hill, and is technically correct, it is also an issue that can be well mitigated using modern C++ structures like string and smart pointers.

Object references and arrays are only two of the many data structures used in code. What I would really like for new languages is to handle or prevent other common data bugs. Instead of asking if a new language lets the user address memory, it would be nice to ask about other bug prone operations:

Does the language allow the equality operator for floating point values?

Does the language allow substring operations on text strings?

What kind of string comparison does the language offer?

Those two operations are allowed in many languages (including C++), but most of the time, using them is a bug. I ranted about floating point equality before. Cutting a substring in a string is a dangerous operations because most languages have an implicit representation of strings, in Go it is UTF-8 (strings are just arrays of bytes), in Java it is UTF-16. UTF-8 contains multi-byte characters, so cutting a string at any arbitrary byte can result in invalid UTF-8 data, UTF-16 has multi 16-bit word characters (surrogate pairs), so cutting at an arbitrary 16-bit word limit can result in invalid UTF-16 data. String comparison is similar in essence to floating point comparison, bit equality comparison, as implemented by most languages, is not what you want: you want to normalise the text representation in some way before comparing the strings.

Now you could argue that these problems should not be handled in the language, that people should use some library that does the right thing. Guess what, this is how C++ solved memory management problems…