Wrapping command-line programs, part III

In the second
article in this series I showed how to use OpenEye's
mol2nam program as coprocess from Python. To make it work I
had to edit the original source code to add an fflush after writing
the name. Otherwise the output was buffered and inaccessible to the
Python wrapper. Sadly, programs in this field rarely come with
recompilable source code. What could be done if I couldn't add the
fflush?

When the C stdio library initializes stdout it checks if the output is
a terminal. If so it sets the output mode to line buffered.
Otherwise it is put into block mode. A console window is a terminal
but files and pipes are not. Not many people use actual terminals
these days ("tty" is short for "teletypewriter"). Instead they are
emulated using what are called pseudo-ttys. We can create and use our
own pty to communicate with the original OpenEye mol2nam in line
buffered mode.

This gets into an aspect of Unix that I don't know well. There's a
30+ year history of terminal control that I've never had to worry and
was never interested in learning. I have only vague ideas of what
ioctl, fcntl and tcgetattr/tcsetattr do. What I'm about to describe
works, but there may be ways to make it work better. Please let me
know if there's a better way.

Instead what I do is let someone else provide a higher-level interface
to the terminal control functions. Pexpect is a Python library
influenced by Don Libes' venerable Expect package. It opens a pty
connected to a process, sets the terminal modes correctly, provides a
Python file-like interface, and a few bits of extra functionality.
For more details you should read the documentation and scan the source
code.

Using pexpect is quite simple, except when aspects of the archaic,
baroque pty interface appear. The main interface is the
spawn class, which takes the command to run and an optional
timeout. The newly created instance implements file-like method, so
you can still read, readline and write to
the interface. Here's the code to connect to mol2nam and
skip the three lines of the header.

One difference between this and the subprocess interface is
that everything is communicated over a single bidirectional
connection. There is no difference between stdin, stdout and stderr.
That's why the previous code snipped used only readline() and
didn't specify which input to read from.

By default terminals echo the input so something written to the
spawn instance will also be read. When I write the SMILES
line to the process I need to skip the echoed response, like this

The spawn class has a setecho method which I hoped
would prevent the echoing. When I toggled it I no longer got output
from mol2nam. I don't know why. I wrote a simple program that
implements enough of the mol2nam protocol to pass the self test but
was not able to make it reproduce the problem. Did I mention I don't
like working with ptys?

The readline method gets both the systematic name sent to stdout and
the error messages sent to stderr. Luckily they are easy to
distinguish because "Warning:" is not in any systematic name. As
before, if there is an error I want to restart the connection. I
can't simply close the input stream then read from the output stream
because they are the same connection. If I close one then I close the
other. Instead I need to use a pexpect method called sendeof
which tells mol2nam that there is no more input. The relevant code is:

Bear in mind that lines coming from a pty end with "\r\n" and not just
the "\n" character. This is an aspect of using an API designed to
support typewriter printer carriages. If the last line had been
return line[:-1] then it would have the extra "\r"
character. The rstrip() method removes all whitespace on the
right so works just fine.

With these in place I ran the regression test code. It didn't pass.
The problem was with the very long SMILES string meant to force
mol2nam to give a segmentation fault. I couldn't figure out what was
going on so I finally wrote a new program that implements the mol2nam
API and should be able to pass the regression test. This let me watch
what was going on with that side of the connection. Because the
stdout and stderr went back to the wrapper code you can see I opened
up /dev/ttyp4 and wrote output there. I had another terminal
window open and from the tty command I knew it was using that
pseudo-tty handle. (In unix nearly all I/O is a file.) By writing to
that file the output goes to that terminal's display. An advantage is
that try/except block around the call to main(). That let me
see the exceptions during testing which otherwise would have been put
somewhere in the pexpect interface.

Here's my test code. I open the debug file in unbuffered mode used
the "-u" option on the #! line to tell Python to use unbuffered stdin
and stdout. The function _W constructs an appropriate
"Warning:" message that will pass the smi2name test code.

When I used this I found that the long SMILES was never getting to the
coprocess, though the short SMILES strings were. Through
experimentation I found that if the SMILES was 1000 characters or
smaller then it would be sent to the coprocess but anything longer
caused problems. When I tried to write a long SMILES I found that I
got several chr(7) bytes if I read from the spawn interface.
ASCII character 7 is for BEL, which should ring the terminal bell.
This strongly suggests the terminal is buffering.

Terminals have two major modes; cooked and raw. When you type
something on the command-line you expect to be able two^H^Ho edit the
line before pressing enter. Various characters get treated as editing
characters, like backspace (which is often either ASCII 8/^H) or ASCII
127/^?) and "kill line" (ASCII 21/^U). You also expect control-C
(ASCII 3) to kill a process and control-Z (ASCII 26) to suspend it.
When the terminal supports these conversions it is in cooked mode
because it is processing the input. Otherwise it is in raw mode.
Programs like vi and emacs use raw mode to capture each character as
its pressed and to change the meaning of things like control-C.

To test if this was the case I used the test string
"CC"+chr(21)+"S". In cooked mode the special character in
the middle kills the line; it erases everything before it on the input
line. If the terminal is in cooked mode then the result should be
"hydrogen sulfide". And indeed it is. I also tried using chr(8) for
backspace but had to switch to chr(127) which is what the terminal
actually uses. The "stty -a" command lists all of the
special characters.

The backspace worked as did chr(3) for control-C and chr(26) for
control-Z.

The problem is we're in cooked mode so the terminal saves a 1000 byte
buffer to allow for editing. I want to switch into raw mode but I
don't know how. When I try "import tty" then
"tty.setraw(mol2nam.child_fd)" then the interface just hangs.
Like I said, I don't know the details of ptys well enough.

Luckily for me I don't need know them. For this interface it's okay
to limit the SMILES string to no more than 1,000 characters and to
prohibit anything other than the printable ASCII characters. I
mentioned in the first essay of this current series that I don't like
checking for incorrect data at this level of the API. The exception
is for cases like this where bad input can cascade and have big or
unexpected problems.

At 80 seconds there is definitely a performance hit using a pty,
probably from terminal cooking through I didn't try to track it down.
Compare that to the subprocess pipe code which runs in 25 seconds and
the command-line interface at 15 seconds. Still it's better than the
first version which restarted mol2nam for every call and took nearly
370 seconds. Unlike the subprocess version it uses the mol2nam
provided by OEChem (no recompile needed) and unlike mol2nam by itself
it reports exactly which structures could not be parsed.

Andrew Dalke is an independent consultant focusing on
software development for computational chemistry and biology.
Need contract programming, help, or training?
Contact me