I’m releasing Cobe 2.0, with a major conceptual change in its data
model and a more flexible scoring system. To explain the changes,
let’s get a little deeper into Cobe 1.x.

Cobe 1.x

When Cobe 1.x learns an input phrase, say “this is a test”, it does
the following:

First, it splits the phrase into tokens. Each series of alphanumeric
characters becomes its own token, and each series of characters
between those becomes another.

So “this is a test” becomes seven tokens:

Cobe 1.x groups these tokens by the Markov Order m (default five)
and adds special END tokens that signal the beginning and end of a
phrase:

And then each group is annotated with the next token that follows.

And the previous token:

As more phrases are learned, Cobe keeps track of all the words that
have ever followed these contexts. After a while, one might look like:

Here you can also see each link annotated with its probability in the
learned text. For phrases beginning “this is ”, “this is a” occurred
75% of the time, “this is another” occurred 17% of the time, and “this
is third” occurred 8%.

When generating a reply, cobe 1.x picks a random next word, shifts it
into the context, and looks for the words that follow that new
context.

In functional terms, this works exceedingly well. But there’s no
denying that it’s complex. Each context is clearly related to all of
the others, but they’re not explicitly connected to each other—the
connections are to the next word.

Maintaining the forward and backward chains independently means it’s
difficult to visualize the brain as a whole, and since spaces are
separate tokens, half of the useful context is taken up tracking
whitespace.

Cobe 2.0

Cobe 2.0 employs two major changes to brain storage that reduce
complexity, ease visualization, enhance processing speed, and leave us
able to use a world of existing general-purpose algorithms.

My first change was to connect contexts directly to each other. Since
each is already stored in the database, this is just a matter of
creating a table that links the previous context to its next context.

Astute readers will note that this creates a graph. Each context is
a graph node, and the graph’s edges are the links between them. Using
a directed graph, Cobe ends up learning “this is a test” like this:

Take note: since the edges of the graph are directed, this same data
structure can be used to walk either forward or backward. No more
training two chains!

At this point, we’re still looking at the equivalent of a 5th order
Markov chain. I wanted to reduce the Markov order by eliminating the
whitespace that’s being stored in the contexts.

The other Markov-based chatbots have taken different strategies
here. They may group spaces and punctuation together as the tokens
that separate words, like Cobe 1.x. They may tokenize punctuation and
spaces separately, but still consider spaces as normal tokens. They
may attach the spaces as leading or trailing properties of words and
reconstruct the spaces when joining the words back together.

Cobe 2.0 takes a different strategy. It eliminates all spaces from the
nodes and makes them a property of the graph edges. Combined with
the graph layout, this is a major breakthrough. It keeps the database
size small and the nodes much more dense with information.

Below is a dump of a 2nd order Cobe 2.0 brain that has been trained
with four phrases:

this is a test

this is another test

another test would help

hal hal hal

The empty node at the top is a special node which signals the end of
the graph. It serves the same purpose as the special END tokens in
Cobe 1.x. Open arrowheads show nodes that were learned with spaces
between them.

Looking at this, you can easily see how a reply might be
generated. Cobe follows a random arrow from every node, building a
list of reply words as it goes.

When generating a reply with 2.0, all of the database queries are on
the edges table alone. It doesn’t even matter what Markov order the
contexts are anymore. Indeed, different contexts could have different
numbers of tokens. “Raining cats and dogs” could be identified as a
single concept (or
collocation), and it could
be treated as a single word for replies.

If you follow the graph above, it’s clear that you could get a reply
like “this is another test would help” or “hal hal hal hal hal hal hal
hal hal hal hal”.

Being able to say quantitatively that the first is a bad reply and
the second is a great reply is a question for reply scoring, and I’ve
made the groundwork for some changes there too.

Scoring changes

Cobe (both old and new) will generate as many replies as possible
during half a second. These are evaluated by a scoring routine, which
determines which of the candidates is best. At the end of the half
second, the best candidate so far becomes the result.

Cobe 1.x used a scorer based on MegaHAL’s evaluate_reply(), which adds
together the
self-information of
all reply tokens that also appeared in the input. This means that more
surprising replies score higher.

MegaHAL’s scorer then applies a couple of dividers to discourage
overly long replies.

Recently I’ve been looking more closely at the candidate replies. Many
of them are subjectively better than the result, and I’d like to
explore scoring for the next few releases.

For Cobe 2.x, I have introduced reply scoring based on several
signals. The information content of a reply is one good metric, but
there are a lot of other things that might contribute to a good
reply. Some examples:

Performance

Cobe 2 is quite a bit faster than 1.x, and a Cobe 2 brain is roughly
half the size of the same 1.x brain.

Learning the Brown
Corpus, 500 samples of
English language text, Cobe 2.0 is about five times faster than
1.2. I adjusted the Cobe 1.2 code to use the same database
parameters as 2.0, to get a more accurate comparison of the
core learning algorithm:

Reply speed is difficult to measure, as it’s dependent on random
factors, but I’m seeing at least double the number of candidate
replies in the new code.

My largest brain in production has learned about 230,000 phrases, for
1.8 million edges between 1.4 million nodes. This is better than Cobe
1.x could manage, but there’s still room for improvement. Learning one
million phrases is my next scaling goal.

Next steps

Where to go from here?

First, explore new scorers. There are a few listed above, plus general
purpose ideas like a Bayesian classifier scorer. That would allow you
to train Cobe based on a list of good and bad replies.

The scorers are sometimes at odds with each other—you might want to
encourage reply length in one case, brevity in another. It might make
sense to build reply profiles out of a set of weighted scorers and
score each set of candidate replies based on a randomly chosen
profile. (e.g. “SHORT AND SHOUTY” or “long and repeating and repeating
and repeating and repeating and repeating”)

Second, continue improving the speed of candidate replies. Assuming
the scorers are going to select the best response, Cobe needs to make
candidates as quickly as possible.

I have two strategies planned for faster replies. Currently, Cobe
performs a random walk from scratch for every candidate. This could be
made a lot faster by turning the walk into a generator, searching the
tree for the END token and yielding a reply every time it’s found.

The second strategy is to start generating replies in parallel. Cobe
is easily distributed across a large number of computers.

A fresh cobe brain starts completely empty. Eventually it will seem to know about vocabulary and sentence structure, but that’s all part of the essential excellence of the halgorithm.

This means you can interact with a fresh brain in any language. As long as it can split the input into words, it will generate sentences that mirror the structure of its input. I’m sure this isn’t a generalization that works for every written language, but it does a decent job in a lot of them.

In fact, cobe does somewhat better than MegaHAL, which only supported 8-bit ASCII. Cobe uses Python’s Unicode support and character tables to detect words, and in practice this works well.

This is one of its key features — access to surreal robot chatter is a basic human right —...

If you’re like most people, you want to know how to simulate a Hammond B-3 organ using fixed-point arithmetic on an 8-bit microcontroller.

First, you’re going to need tone generators. In a Hammond, this is a set of 91 tonewheels. Each is positioned adjacent to an electromagnetic pickup, inducing a sinusoidal current at a specific frequency. On a microcontroller, we can simulate the tonewheel assembly with a set of 91 sine oscillators. Take a look at bleep bloop for direct digital synthesis, one method of sine wave generation. We’ll be using it below.

A B-3 has two keyboards, called manuals on an organ. When you press a key, nine of these oscillators connect to the audio output. Each corresponds to a note that’s harmonically related to the key you pressed (including that...

I don’t like lugging my laptop up to my MIDI keyboard every time I want to play music, so I have been writing a MIDI synth that uses an Arduino and Collin Cunningham’s MidiVox Shield.

The MidiVox shield comes with software called Healer, a synthesizer with several waveforms, a resonant filter, and traditional envelope controls. Healer is nice, but it’s monophonic: it only plays one note at a time. I want a polyphonic synth, so I can play crazy chords.

Here are some notes on making a polyphonic MIDI synth from scratch.

Step 1: beeeeeeeeeeeeeeeeeeee

First step? Make a noise. There are a number of ways to get sound output from an Arduino, and MidiVox is an excellent option. It has a high quality, 12-bit digital to analog converter on board: the ...read more New noises from MidiVox.

When the time came to improve cobe’s performance, it was easy to measure how well it was learning new text—the learning process is database intensive, but deterministic depending on the input. I could use Python’s built-in profiler, find the problem spots, and optimize as usual.

When I tried to use the same tools on cobe’s reply routines, I started to have trouble. A reply is a random walk across its Markov chains, repeated as many times as possible within a half-second time slice. I could use the same input data on the same brain of learned text and get wildly different results, depending on which input word was (randomly) chosen to follow and the state of the chains from that point.

I like MegaHAL. It’s a conversation simulator, and it gathers information about sentence structure and vocabulary while spouting on-topic insights and synthesizing clever turns of phrase.

It’s all a trick, a bit of nice programming combined with our relentless pattern-searching brains, but an amusement and an endless source of the surreal.

I found MegaHAL while I was in college. Some friends and I spent a few late nights huddled around a browser entering phrases in exchange for rare gems of robot insight. I downloaded its software, hooked it up to an IRC client, and named it hal. He would learn from all the messages we said to each other, and he’d reply if spoken to directly.

I started using Screen on a daily basis, and quickly desired an ssh-agent tied to a long-lived screen session.

The obvious solution (running "ssh-agent screen") doesn't work—as soon as you log out, screen drops into the background and you lose the agent.

People have solved this problem in various ways—a quick search shows some of them. I didn't like them. They all rely on scripts external to screen for running the agent and maintaining its environment variables.

I spent a little time last night on a solution. It's completely contained in your .screenrc, and produces an ssh-agent tied to screen's life cycle.

If you are offended by discussion about Emacs, please stop reading now.

Over the last few years, Dan Mills and I have been improving an Emacs configuration style we both enjoy using. There has recently been some interest in this setup internally, so I thought I'd pull some notes together and send it out into the world.

There were two original goals for this design:

A good, clean configuration—none of the mess that usually clutters .emacs files.

Easy deployment of new elisp packages.

The configuration comes down to three things.

First, all the elisp files I own, either packages I've downloaded or things I've written myself, should exist under one tree. This eases maintenance, and allows me to keep everything in Subversion.