Whatever I find interesting.

A few weeks ago, a colleague came to me with an interesting bug: When running a child process with Python’s subprocess module, no exception is thrown when the child process fails. In essence, what happened was the following (typed at the interactive Python prompt):

>>> import subprocess
>>> subprocess.check_call("false")
0

The false command always exits with a nonzero exit code. The expected behavior, as confirmed on another machine, would be as follows:

So, what happens is that subprocess thinks that the child process exited successfully, even though it did not.

This behavior, of course, wreaks total havoc with the application. Instead of an exception being thrown due to the failed child process, the application goes on and fails at a later point when it tries to do something based on the data received from the child process. This goes against the good design principle of failing early, as embodied by the way Python uses exceptions instead of error codes.

So, what is going on here? You may want to think it through and see if you can find the problem.

A colleague asked me to look into a problem with him, mentioning that “Tlib hangs when we run it”. Tlib is a fairly large project that is written in Python. His initial analysis showed that it hangs at a very early phase, during with it tries to fetch the latest version to run from a git server. Various users complained about the same problem, suggesting that it is not a local issue.

Running the code, and interrupting it with ^C when it hangs, turned up a result similar to the following:

Lately I’ve been working on modifying an architectural aspect of an existing software project. This project makes heavy use of remote execution of code on several hosts. To accomplish this feat, it uses several different methods for remote execution: SSH for running general shell commands, RPyC for executing arbitrary Python code remotely, as well as a couple of proprietary interfaces.

One issue I encountered with the current design is that it is quite difficult to make a clean separation between code running locally and code running remotely. This turns debugging any problem involving execution of remote code into an incredibly complicated endeavor.

Much clarity could be gained by changing the architecture of this project to be more explicitly distributed. This would involve several agents running on multiple hosts that communicate amongst themselves to get the work done. The agents would have a clean and documented API, making them usable and testable as standalone components, as well as allowing them to act on behalf of a central process. All remote execution would be explicit, using one single method for any kind of execution – be it shell commands, Python code or anything else.

Disclaimer: This post is about software development, but it is more about the human side of it than about the technical side. This means you may enjoy it even if you’re not a software developer (or a technical person at all).

One of the recurring issues that I encounter as a software developer is that of having to decide where on the scale between YAGNI and NIH I want to be. You are probably wondering what the hell I am talking about (unless you’re one of those people who are nodding your head now and smiling), so I’ll explain.

YAGNI stands for “You ain’t gonna need it”, and refers to a software development principle according to which you don’t want to start writing a huge bunch of code up front until you know exactly what you are going to need. Since software requirements change all the time, and one of the common maladies of many code bases is superfluous complexity due to over-engineering and over-abstraction, it makes a lot of sense to implement only what you know you are going to need right now, and leave the rest for later.

NIH stands for “Not Invented Here”, and refers to a reluctance among some software developers to base their work on code written by someone else: “Hey, I can do this myself, so why should I use this existing project”? This leads people to reinvent the wheel time and again, creating a proliferation of half-baked solutions to problems that have been solved fairly well at other times and places. In addition to the waste of resources in the creating the solution in the first place, it also creates yet another maintainability headache.

At work, I encountered an interesting problem: While testing the Fibre Channel (FC)
scalability of a storage product, we needed to create a lot of FC connections between hosts and the storage system. This would in turn require a large number of FC Initiators, each of which having a unique World-Wide Port Name (WWPN).

The easiest and cheapest method to set up a lot of initiators without actually purchasing zillions of FC HBAs would be to use N_Port ID Virtualization, a.k.a. NPIV. This method allows a single FC HBA to present itself to the FC fabric with multiple WWPNs. This, in turn, allows the creation of many connections to the target storage device from a small number of hosts.

The Problem

WWPNs can’t just be pulled out of thin air. They are allocated – in chunks – by a central authority, the IEEE Registration Authority. Just making up random WWPNs could cause trouble for two reasons:

The WWPN must be unique on the fabric, which means it must be generated in a deterministic way so that two hosts won’t be using the same WWPN and thereby confuse the fabric.

The WWPN should not have a chance of clashing with official WWPNs of purchased HBAs.

With physical (as opposed to virtual) HBAs, this is managed by allocating a OUI (Organizationally Unique Identifier) to every vendor, who in turns tacks on his own vendor-specific serial number to come up with a unique WWPN. This is similar to the MAC address allocation of Ethernet, Wi-Fi and Bluetooth devices.

The textbook solution for our problem would have involved the use of an officially allocated OUI to generate legal WWPNs, but that seemed like overkill for lab project which would never be used on a production SAN.

Recently, a colleague and I refactored a piece of existing code that had new
behavior added to it. During the process, we managed to improve the readability of the
code using several techniques that I’ll describe below.

Where We Started

The original code was fairly simple: It decides whether certain “dead”
components need to be “revived”, and presents the user with a prompt to choose
from one of several actions.
Depending on the user’s choice, the code then proceeds to take the appropriate action:

defrevive_dead_components():choice=choose('"I see dead components..."\n'"Do you wish to (r)evive them, (c)ontinue without reviving, or (q)uit?",{"r":"revive","c":"continue","Q":"quit"},default="quit")ifchoice=="revive":revive_components()elifchoice=="quit":raiseTestCannotRunException("Dead components exist")elifchoice=="continue":pass

Adding New Behavior

We now wanted to add some new functionality to the above code, namely the ability to
allow the user to select a subset of the components that he wants to revive.

To make things more foolproof, in case the user chose to revive selected
components but then neglected to select any components from the list, the code
would not proceed blindly but rather send the user back to the menu so that he could try again.

defrevive_dead_components():whileTrue:choice=choose('"I see dead components..."\n'"Do you wish to (r)evive them all, (s)elect components to revive, ""(c)ontinue without reviving, d(i)sable reviver, or (q)uit?",{"r":"revive","s":"select","c":"continue","Q":"quit"},default="quit")ifchoice=="revive":revive_components()breakelifchoice=="quit":raiseTestCannotRunException("Dead components exist")elifchoice=="continue":breakelifchoice=="select":selected=show_menu("Which components would you like to revive?")ifselected:revive_components(selected)breakelse:logger.info("Nothing selected...")

To get the required behavior, we used an infinite while True loop that terminates with an
explicit break when a valid choice is made by the user and reiterates otherwise.
This ensures that we don’t continue until a valid choice is made.

Can We Do Better?

The problem with the above method is that the flow control is not immediately
apparent when looking at the code: It’s not obvious that the infinite loop
should actually terminate in all but one case. A future developer could easily
break this behavior.

In addition, the if/elif ladder becomes a bit too long to read
comfortably.

defrevive_dead_components():classInvalidChoiceError(Exception):passdefchoice_revive():revive_components()defchoice_quit():raiseTestCannotRunException("Dead components exist")defchoice_continue_():passdefchoice_select():selected=show_menu("Which components would you like to revive?")ifnotselected:logger.info("Nothing selected...")raiseInvalidChoiceErrorrevive_components(selected)whileTrue:choice=choose('"I see dead components..."\n'"Do you wish to (r)evive them all, (s)elect components to revive, ""(c)ontinue without reviving, d(i)sable reviver, or (q)uit?",{"r":"revive","s":"select","c":"continue","Q":"quit"},default="quit")try:locals().get('choice_%s'%choice)()breakexceptInvalidChoiceError:continue

We use several techniques here to improve the clarity of the code:

We used internal functions to encapsulate the possible actions to take. The
advantage of using internal functions is that it keeps the external namespace
clean, and the naming of each function makes its purpose quite clear.

We used a dictionary instead of the if/elif construct. Since Python
doesn’t have a switch or case statement, this is a more readable replacement.

We decided to use an exception to signify, well, exceptional flow control: If the user hasn’t
selected any components, this warrants exceptional behavior. This technique is
much debated, but we felt like it was appropriate in this case.

The function name to be called is determined dynamically at runtime from the
user’s selection. The idea was to avoid code duplication by needing to specify the names
of the functions yet again (but see below).

Removing Some Coolness For Readability

My colleague pointed out that the locals().get('choice_%s' % choice)() trick
is not quite readable. I agreed, and was happy to accept his improved proposal: