The first rescue handles exceptions that can be raised when reading
data would cause the system to block. We handle this situation by doing
nothing and waiting until later to try reading again. The second
exception we address is an EOFError, which isn’t necessarily
an error. For example, if the client has told the server it is all done
and intends to disconnect, the server may shutdown the connection which
can cause this exception to be raised. Even when an EOFError is
unexpected, it always tells us that it’s time to shutdown the connection
and move on.

This brings us to the final rescue block, which handles every other
kind of exception. The system responds by shutting down the connection,
firing a terminated event, and then re-raising the exception. I chose
to re-raise the exception to give additional feedback to developers,
which might have been handy if it weren’t for the fact that the
exception is being raised within a separate IO processing thread.
Thus, the exception can only be rescued when the thread is joined, which
doesn’t happen until the developer has called disconnect on the
OnStomp::Client object. By that time, it is far too late to handle
the error in any meaningful way.

So, it’s pretty obvious that re-raising exceptions in this fashion is
useless, now I want to show you why it’s also problematic. Suppose that
we’ve got this bit of code:

If an Errno::ECONNRESET exception is raised before all of the
messages have been sent, the developer won’t know it until
client.disconnect has been called. Now, from the server’s perspective
everything behaves as expected: the transaction starts with a BEGIN
frame, some messages are received, and the connection is closed before
a COMMIT frame is received, so the received messages are discarded.
The developer, on the other hand, gets screwed: the exception is raised
in a separate thread from where the transaction block is evaluated.
All of the messages have been deleted from the persistent storage and
not one of them has actually been accepted by the server, which is a
dick move on my part. Now, to be fair, I never intended the
transaction block to work in this way. The intent of transaction
was to deal with exceptions generated within the block itself and
either commit or abort the transaction automatically. In other words, if
remove_message_from_persistent_storage raised an exception, the
transaction block would be there to rescue it, send an ABORT frame
to the server, and then halt any further transaction processing. My
original intentions aside, it sure would be nice if the developer had
an easy way to verify that the transaction had been committed before
deleting those persisted messages. This is the crux of
a comment by celesteking,
and I think it’s a pretty good idea.

However, this only works if we’re performing a single transaction. If
we have multiple transactions, then we need to keep track of transaction
IDs and our on_commit call turns into a giant case statement.
Clearly this is a less than desirable solution.

A better approach might take the following form:

# Pull messages from a file, database, etc.messages=get_messages_from_persistent_storageclient.transactiondo|t|t.on_abortdo# Called when the transaction is explicitly aborted or an IO# exception prevents sending the final COMMIT frame.endt.on_commitdoremove_all_messages_from_persistent_storageendmessages.eachdo|msg|t.sendmsg.destination,msg.bodyendendclient.disconnect

I suppose these blocks could also be supplied as parameters to
the transaction method (note: I’m using a Ruby 1.9 style hash)

Personally, I prefer the former over the latter, but it requires a bit
more effort to implement if I want to guarantee that any on_abort
callback will be called even if it is defined after a the ABORT frame
has already been generated (or if an IO exception has occurred while
processing the transaction.)

Hopefully this illustrates why triggering a terminated event on the
connection and re-raising the exception is both a useless and
inadequate way of handling errors, and why something better is needed.

IO processing stops until the callbacks are completed. A
long-running callback could very easily choke the connection.

If an exception is raised within a callback it may
percolate up to the threaded processor, which will kill the IO loop
and ruin everyone’s day.

I don’t think I really need to explain this problem further, so let’s
move on to the last issue.

The Code Base

There is a whole lot of code within OnStomp that has nothing to do
with the actual Stomp protocol, which makes it difficult to figure out
exactly what OnStomp is doing. Also, some of the library’s packaging
makes no sense to me now – fortunately, that’s mostly my problem and
not something anyone else should have to worry about.

Fixing It

I’ve begun work on the proper fixes for these problems. The first step
will be factoring out all the non-blocking IO stuff into a separate
gem: io_unblock. Rather than
a connection that dispatches events, this will be working with
lambdas, procs or blocks that are always invoked from within
the IO processor thread. Within OnStomp, these callbacks will not be
available to developers, instead they will be used to enqueue events
that an event dispatcher will invoke later. I don’t know if a callback
invoking, threaded IO-ish object will be of use to anyone else, but
as it will in no way be tied to the OnStomp gem or the Stomp protocol,
I see no harm in releasing it on its own.

I’m toying with the idea of putting all of the event dispatching code
into a separate gem, as well. However, I won’t know if this is
warranted until I actually start mucking about with it. In any case,
I do plan on running the dispatcher within yet another thread to
resolve the issues surrounding the current event dispatch process.

There are a couple of things I need to keep in mind while making these
changes. First, being able to safely reconnect within a
on_connection_closed event callback should not be the exercise in
coding gymnastics that it is right now. Second, I need to preserve
the ability developers have to use the before_transmitting or
before_<frame command> events to change frames before they are
serialized and sent on to the broker. Right now this is trivial because
the client triggers those events and after the callbacks complete, it
passes the frame off to the connection. If event dispatching is
handled within a separate thread, some care must be taken to ensure
that the dispatcher has triggered those events before the frames are
serialized.

I think that’s enough for now, I’ll post more when I start making some
actual headway.

To be completely honest, I’m not much of a reader. My reading speed lies
somewhere between “painfully slow” and “potentially illiterate,” so reading
fiction, in particular, is more chore than hobby. I love works of reference
when they have well constructed indices or, better yet, hyperlinks
between related topics — my bookshelves and wikipedia usage can
attest to this. I’m also fascinated with “meta-reading” fiction, where I read
summaries, critiques and so forth of some work to gain insights into what
it’s all about, but without reading the actual source material. This
fascination is probably the only reason I didn’t fail every high school
English class. The point of this, which I’m dangerously close to completely
losing, is that my thoughts on a book are generally not of much use to anyone
else.

With all of that in mind, I will say that
Learn You a Haskell for Great Good! is
probably the best book I’ve purchased in a while. I’m only into the 6th
chapter now — again, painfully slow reading speed — but the
tone, narrative, examples and pretty pictures make it an enjoyable read for
me. There are externalities working in LYAH’s favor, namely my strong
interest in Haskell as it appeals to “mathematics degree” brain, and a recent
uneasiness I’ve developed regarding some elements of the Ruby/Rails world;
however, I think the book would be just as engaging without these factors.

So, if you’re the kind of person who buys books based upon reviews that
could have been written by an 11 year old, pick up a copy
of Learn You a Haskell for Great Good! right now.
Thank you, Miran Lipovača, for such a unique Haskell introduction.

On the topic of books, but not directly related to Learn You a Haskell,
if you want a great introduction to combinatory logic, grab a copy of
To Mock a Mockingbird
by Raymond Smullyan. It’s entertaining, a fairly easy read and, unfortunately,
out of print, so you’ll probably have to buy it second hand.

While lurking on the #ruby-lang channel on freenode.net,
oddmund suggested creating a method_missing
handler that would “auto-correct” misspelled method. His suggestion was of
course a joke, but it was a pretty good one, so
he implemented it. I provided
my own implementation, along with some other
terrible ideas.

I will never push terrible_things to RubyGems for two reasons:

It is a joke, obviously.

Its features are incompletely implemented.

The goal of implementing this gem was to have a bit of fun and maybe explore
something potentially useful along the way. One useful bit came from
Ryan Davis, Gem::Text includes a Levenshtein
distance calculator, so that’s handy. The “semi-monadic” treatment of
NilClass actually wasn’t all that interesting. The idea has been done
a lot, and simply returning itself for method_missing doesn’t make it
a real “Maybe monad.” The fun and interesting part has been the lazy
evaluation bits. By overloading Object::new, all object instantiation can
be evaluated lazily. With a bit of enumerator handling, we can work with
infinite lists in Ruby by virtue of lazy evaluation.

Others have done similar things
with Ruby, but most of them tend to only defer evaluation once, if further
methods are invoked on the lazy “proxy”, then its fulfilled to a real object.
To be fair, a useful lazy evaluation library really can’t be both correct and
indefinitely lazy in Ruby — special care has to be taken when dealing
with methods that mutate their object versus ones that return new objects.

Regardless, the whole exercise has been quite a bit of fun, and being able
to transform, filter, and take from infinite lists is pretty sweet. I can
definitely think of a cleaner implementation than the one derived from lazy
evaluation, though it ultimately relies on the same trick of using
Enumerator#next so that finite numbers of elements can be plucked from
these lists that would be impossible to directly compute.

This post is a bit light, both in length and in technical content, my
apologies. I would like to say that so far I’m impressed with so-called
“e-cigarettes.” They still provide nicotine, so I doubt they’ll be much help
in breaking my addiction, but at least I’m not getting my fix by breathing
deeply of burning organic matter. Now if only I could get nicotine in an
easily digested gel tablet.

The Overview

What I’m going to talk about isn’t new or novel, but it’s something that
interests me, so there. I want to look at using Bayes’ theorem to
classify documents. This has been done to death with regard
to classifying spam, but I’m going to take a more general look. For our
purposes, a document is a sequence of terms such as:
“This cat isn’t made of butterscotch”; a paragraph, or the HTML source of this
page. The idea is to classify a bunch of documents by hand to “train” our
classifier so that when new documents come along, we can ask the classifier
where they belong. One way to accomplish this is through the use of Bayes’
theorem.

The Real Work

To start, we’ll need a way to tokenize a document into terms. The
tokenizer’s implementation will depend upon the type of documents under
consideration. To keep things somewhat simple, let’s assume we’re working
with English phrases; thus, the terms produced by our tokenizer are simply
words. Let’s suppose we have a set of categories,
.
So, what we’re then after is the probability that a document belongs
in a particular category, , given these extracted terms:

We’re letting represent our document, and then
expanding the document into its individual terms, .
You’ll notice that we’ve skipped the expansion of , the
justification for this will be explained shortly.

We can expand this using the rules of conditional probability:

This formulation is available on the Wikipedia page on Naive Bayes classifiers.
As we are taking a naive approach, we assume there is no interdependence
between terms, ie:

Now, when it comes to words, this is a obviously poor assumption. For instance, it is far
more likely that the word “poor” would appear before “assumption” than the
word “Faustian” in day-to-day phrases. However, we’re going to go ahead with
the naive assumption for now, which simplifies our equation:

For any term, , we take to
mean the probability of that term occurring given we’re working with
category , and we calculate it thusly:

where is the number of times term
occurs in category . This brings us to ,
meaning the probability of choosing category without taking
the document into consideration. There are a number of ways we could
calculate , such as assuming that all categories are
equally likely:

where denotes the number of categories in our
classifier. While this works, a better measure of might take into account the
number of documents a category has been trained with:

where indicates the number of documents belonging to
category . All other things being equally,
will favor categories that have been fed the most
documents. If each category is trained with the same number of documents,
then we’re back to the uniform probability given earlier.

Basing on document counts does pose a problem. Suppose
the following documents belonged to the same category:

this document is pretty small
small this document is pretty
pretty small this document is
is pretty small this document
document is pretty small this

These are all distinct permutations of the same string, and thus distinct
documents. The problem is that the classifier we are in the process of
constructing deals with terms. Our assumption of the independence between
terms means that their ordering has no bearing on the classifier. From this
position, we could argue that each of the five documents listed above are
actually the same document as far as our classifier is concerned, and if we
relied on document counts for , this set of documents
would give some category an unfair bias. So, let’s consider another
alternative for calculating :

This approach is calculating the probability of a given category as the
sum of occurrences of all terms in the category and the sum of occurrences
of all terms across all categories. Unfortunately, this form may also be
unfairly biased by document repetition (e.g. if we fed a category the
five documents shown earlier.) Let’s consider one more alternative:

where and are defined
as:

These definitions may seem complicated, but they are both very simple
concepts: is just counting the number of distinct terms
in category and is counting the total
number of distinct terms from all categories.

While each of the four possibilities presented for are
very different from one another, they are all valid approaches. Personally,
I feel is too naive in its assumption that all
categories are equally likely to be selected, and as I intend on keeping
track of only terms and categories, I can also rule out
. As mentioned earlier, can
be biased by repeated terms, so I’m going to opt for
. Again, I wish to stress that this is a
personal choice, your Bayesian needs may differ from mine.

With that all taken care of, we can now expand our classifier to:

Our problem is now just one of counting. The expression may look horrible,
but that is largely a result of each expression being explicitly defined. Now
that we know what the expressions mean, let’s do an informal clean up of
this warlock by substituting simpler symbols for our sweet mess of expressions.
Before we do, we need to take a closer look at , which in
earlier revisions of this post was explicitly, and incorrectly, defined.
For a given document, will be the same, regardless of the
category under consideration. When our classifier is asked to classify a
document, it will iterate over each of its categories and perform this
calculation, which each of the final category probabilities will be scaled
by this same term. So, our first simplification will be to substitute
with , which we will not
explicitly define as we will eventually discard it.

Now, let’s let be the number of unique terms in the
category we are considering and be the number of
unique terms in all of our categories. Finally, we let
be the number of occurrences of term in the current
category and be the occurrences of all terms in all of our
categories. With these simplifications, our equation becomes:

Hopefully it is now painfully obvious that each category will be scaled
by a common factor, namely . The purpose
of our classifier is to pick the category a given document most likely
belongs in, and if they are all scaled by this common term we can disregard
it. In an effort to preserve mathematical accuracy, I have opted to multiply
both sides of the equation by .

All we’ve gained from this simplification is some mental manageability. I
wanted to derive the explicit calculations involved in Bayesian
classification, but products of fractions containing summations don’t tend
to be very memorable. Any additional calculations we may need to perform
on the explicit form would have made things even messier, which serves
as a nice segue into our next step.

In a world where we are free to perform these calculations without loss of
precision, this expression is just fine. However, given that we are
multiplying a series of numbers all in the range , a
computer may eventually round a product to 0, and totally ruin our day. We
can correct this with a little help from a
logarithm*:

Is this logarithm business really necessary? Consider a document with
324 unique terms where, thanks the awesome power of contrived examples,
for each term. When we take the
product of all these terms, we end up with . That’s
a pretty small number, so small that Ruby mistakes it for 0:

0.1**324# => 0.0

Now, we did factor out of our product earlier, but
is also bound to be very small. This
post alone is up to 608 unique terms, just to give an example of how likely
it is that our classifier might experience floating-point underflow.

Digging Deeper

There are a few edge cases to consider, here are five I can think of off of
the top of my head:

What if there are no known terms for a particular category?

What if there are no known terms for the classifier as a whole?

What if a term is unknown to a particular category?

What if a term is unknown to the classifier as a whole?

How do we handle repeated terms (i.e. ) in a
document that our classifier is attempting to classify for us?

We can easily address the first two cases. If any category has not been trained, we
assume , thus if none of our
categories have been trained (case #2), we assume all categories are equally
likely (or unlikely) and either return an arbitrary one or none at all.
The third and fourth cases are consequences of the fact that we are
estimating probabilities. Through training, our classifier is building
approximations for these various probabilities. It is
entirely possible, perhaps even very likely, that we will encounter documents
we wish to classify containing terms that were not included in our training.
If we rely solely on the calculations presented here, a foreign term will
result in zeroing the overall probability or produce an error —
either from a division by zero or from the undefined
calculation. To prevent garbage results, we will need to assign a non-zero
probability, , to terms unknown to a particular
category.

The fifth case deserves its own paragraph. To handle this situation, we
defer to simple rules for conditional probabilities. Consider the following:

If and are describing the same event, then
the expression simplifies to:

The same holds true for documents we wish to classify. We could argue that
repeated terms do not describe the same event because the terms are occurring
in distinct positions; however, we have not worked positional information into
our classifier. It is important to note that when we are training our
classifier with documents, we do take repeated terms into consideration.

Final Business

Now that virtually every email client has, at one point or another, made
use of some form of Naive Bayes classification, I realize it’s no longer
the hot topic it once was. However, while reviewing the Ruby
classifier, I discovered that their
Bayes implementation was wrong. This was independently confirmed by
Brian Muller. While working on a fork of it
with Zeke, I discovered that while the approach is
well known, there are nuances that can significantly impact the results
— examples being the choice in and the default
probability, .

Foot Noted

Note: Logarithms

I would like to thank Brian Muller for
explaining the use of logarithms in a
Naive Bayes classifier
he’s developing in Ruby, even if we don’t share the same views on
. My Numerical Analysis professors would probably be
displeased that I forgot about this trick.
[ jump back ]

I’ve written the first follow-up to Properties of Code: Functional Complexity
about 3 times now and have scrapped it 3 times. Every attempt has been less
“mathy” than the start of the series; each has contained an interesting
point or two, but those nuggets get buried under a mountain of meandering.

The biggest hinderance has been that there has been too much “feeling” and
“guess work” in my efforts. My desire to show a relationship between
Functional (or, more precisely, Satisfaction) complexity and Kolmogorov
complexity provided a pretty clear path for me to follow. Readability is a
bit more vague.

What I need is more empirical data to direct my thoughts and efforts. I have
some hypotheses, some suspicions and very little data. Espousing a position
without the backing of mathematical formalisms or strong data makes me uneasy,
and that tends to lead to a rambling and incoherent narrative. I do find
something amusing in the fact that my efforts at writing on readability have,
thus far, been virtually unreadable.

As I am a single-task kind of guy, I feel compelled to work out my thoughts
on readability before moving on to anything else. However, it is pretty
clear to me that doing so will totally stall this site. So, rather than
fixate on writing a narrative around claims I cannot support at this time,
I’m going to list my thoughts and start collecting data. Once I have some
meaningful data to work with, I’ll dig into it and see what comes out. If
I find support for my claims, great. If the data refutes some or all of my
thoughts, even better. This will give me something to work from and
hopefully produce a much more meaningful essay than the shit I’ve attempted
to cobble together so far.

At any rate, here are my suspicions, naked and, for all intents and purposes,
completely unfounded at this time:

Any link between Kolmogorov Complexity and Readability is a trivial one
(eg: a program has high Kolmogorov complexity and is unreadable because it
is nearly a random string of characters.)

The link between a “readable” story and “readable” code is probably
stronger, but the two are not equivalent.

The visualization of quantitative data has implications in readability of
code, and potentially the readability of programming languages.

Code produces two products: a solution to some problem to be consumed by
users, and the code itself, to be consumed by developers. Writing code
that only solves the problem is a partial victory.

Now that I’ve put a few thoughts out there, I’m mentally free to think about,
and write on other topics, while I collect data and look for evidence.