Tuesday, June 26, 2007

A floating-point number is neither a Real number nor an Interval --- it is an exact rational number. A floating-point number is not uncertain or imprecise, it exactly and precisely has the value of a rational number. However, not every rational number is represented. Floating-point numbers are drawn from a carefully selected set of rational numbers. The particulars vary among the different floating-point formats. The popular IEEE 754 standard uses these:
Single Precision
The significand is an integer in the range [8388608, 16777215) which is multiplied by a power of 2 in the range
2-149 = 1/713623846352979940529142984724747568191373312
2104 = 20282409603651670423947251286016
Double Precision
The significand is an integer in the range [4503599627370496, 9007199254740991) which is multiplied by a power of 2 in the range
2-1074 = 1/202402253307310618352495346718917307049556649764142118356901358027430339567995346891960383701437124495187077864316811911389808737385793476867013399940738509921517424276566361364466907742093216341239767678472745068562007483424692698618103355649159556340810056512358769552333414615230502532186327508646006263307707741093494784
2971 = 19958403095347198116563727130368385660674512604354575415025472424372118918689640657849579654926357010893424468441924952439724379883935936607391717982848314203200056729510856765175377214443629871826533567445439239933308104551208703888888552684480441575071209068757560416423584952303440099278848
Since the power of 2 can be negative, it is possible to represent fractions. For instance, the fraction 3/8 is represented as

-25 12582912
12582912 * 2 = ------------- = 3/8
33554432

in single precision floating point.
In double-precision, 3/8 is represented as

.
Not every fraction can be represented. For example, the fractions 1/10 and 2/3 cannot be represented. Only those fractions whose denominator is a power of 2 can be represented.
It is possible to represent integers. For instance, the integer 123456 is represented as

-7 15802368
15802368 * 2 = --------- = 123456
128

.
Not every integer can be represented. For example the integer 16777217 cannot be represented in single precision.
--------
When floating-point numbers appear in programs as literal values, or when they are read by a program from some data source, they are usually expressed as a decimal number. For example, someone might enter "38.21" into a form. Some decimal numbers, like 0.125, can be represented as a floating-point number, but most fractional decimal numbers have no representation.
If a literal number cannot be represented, then the system usually silently substitutes a nearby representable number.
For example, if the expression 0.1 appears as a literal in the code,the actual floating point value used may be

-27 13421773
13421773 * 2 = ------------- > .1
134217728

.
as another example, 38.21

-18 10016522
10016522 * 2 = ----------- < 38.21
262144

.
Most implementations substitute the nearest representable number,but this has not always been the case.
As it turns out, much of the error in floating-point code comes from the silent substitution performed on input.
--------
Floating-point numbers are usually printed as decimal numbers. All binary floating-point numbers have an exact decimal equivalent, but it usually has too many digits to be practical. For example, the single-precision floating-point number

.
doesn't share any digits beyond the first 7.
There is no standard way to print the decimal expansion of a floating-point number and this can lead to some bizarre behavior. However, recent floating point systems print decimal numbers with only as many digits as necessary to ensure that the number will be exactly reconstructed if it is read in. This is intended to work correctly even if the implementation must substitute a nearby representable number on input. This allows one to `round-trip' floating-point numbers through their decimal representation without error.
(NOTE: Not all floating point systems use this method of parsimonious printing. One example is Python.)
There is a curious side effect of this method of printing. A literal floating-point number will often print as itself (in decimal) even if it was substituted in binary. For example, if the expression 0.1 appears as a literal in the code, and is substituted by

-27 13421773
13421773 * 2 = -------------
134217728

.
this substitute number will be printed as "0.1" in decimal even though it is slightly larger than 0.1
As mentioned above, Python does not use parsimonious printing, so it is often the case in Python that a literal floating-point number will not print as itself. For example, typing .4 at the Python prompt will return "0.40000000000000002"
The round-trip property is nice to have, and it is nice that floating-point fractions don't seem to change, but it is important to know this fact:
The printed decimal representation of a floating-point number may be slightly larger or smaller than the internal binary representation.
--------
Arithmetic
Simple arithmetic (add, subtract, multiply, or divide) on floating-point numbers works like this:
1. Perform the operation exactly. Treat the inputs as exact rationals and calculate the exact rational result.
2a. If the exact rational result can be represented, use it.
2b. If the exact rational result cannot be represented, use a nearby exact rational that can be represented. (round)
This is important, and you may find this surprising:
Floating-point operations will produce exact results without rounding if the result can be represented.
There are a number of interesting cases where we can be assured that exact arithmetic will be used. One is with addition, multiplication,and subtraction of integers within the range of the significand. For single-precision floating point, this is the range [-16777215,16777215]. For double-precision,
[-9007199254740991,9007199254740991]. Even integer division (modulo and remainder) will produce exact results if the inputs are integers in the correct range. Multiplication or division by powers of two will also produce exact results in any range (provided they don't overflow or underflow).
In many cases floating-point arithmetic will be exact and the only source of error will be the substitution when converting the input from decimal to binary and how the result is printed.
However, in most cases other than integer arithmetic, the result of a computation will not be representable in the floating point format. In these case, rule 2b says that the floating-point system may substitute a nearby representable number. This number, like all floating-point numbers, is an exact rational, but it is wrong. It will differ from the correct answer by a tiny amount. For example, in single precision adding .1 to 0.375:

Sunday, June 24, 2007

To understand floating point arithmetic, we have to describe how floating-point numbers map to mathematical numbers and how floating point operations map to mathematical operations. Many people believe that floating point arithmetic is some variation of interval arithmetic. This view can be described in this picture:

Interval view
In this view, there is a mapping from floating point numbers to intervals. For each floating point operation, there is an operation on the intervals that the floats map to. The resulting intervals are then mapped to the corresponding floating point result.
An alternative view is this:

Rational View
In this view, each floating point number represents an exact rational number. Operations on floating point numbers are mapped to the corresponding exact rational operation. If necessary, the resulting exact rational number is mapped to an appropriate floating point number by rounding. Since some rationals correspond exactly to floating point values, rounding may not be necessary.
Are these views equivalent? For our purposes, an interval is a set of real numbers we can describe by two endpoints, the lower bound and the upper bound. Either of the endpoints may be part of the interval or not. Given an arbitrary real number X, any pair of real numbers, one of which is below the rational, one of which is above, will define an interval that contains X.
There is no one-to-one mapping from rational numbers to intervals: a rational number such as 5/8 is contained by any interval with a lower bound less than 5/8 and an upper bound greater than 5/8. For example, the intervals [0, 30), (1/2, 1], and (9/16, 11/16) all contain 5/8. Also, any non-degenerate interval (that is, where the endpoints are distinct), contains an infinite number of rationals.
These views cannot be semantically equivalent, but are they *computationally* equivalent? A computation that used the interval model may yield the same floating point result as one that performed under the rational model. (Obviously, we would reject a model which didn't give the result `1 + 1 = 2'.) But do all computations performed by one model yield the same answer as the other? If not, then we can at least distinguish which model an implementation uses if not decide which model is better.
To determine computational equivalence, we need to be precise about what operations are being performed. To aid in this, I have defined a tiny floating point format that has the essential characteristics of binary floating point as described by IEEE 754. This format has 2 bits of significand and 3 bits of mantissa. The significand can take on the values 0-3 and the exponent can take on the values 0-7. We can convert a tiny float to an exact rational number with this formula:
(significand + 5) * 2 (exponent - 5)
There are 32 distinct tiny floating-point values, the smallest is 5/32 and the largest is 32. I use the notation of writing the significand, the letter T, then the exponent. For example, the tiny float 2T3 is equal to (2 + 5) * (2 ^ (3 - 5)) = 1 3/4. Since there are so few of them, we can enumerate them all:

Table of real intervals
Arithmetic on intervals is relatively easy to define. For closed intervals, we can define addition and subtraction like this:
[a b] + [c d] = [a+c b+d]
[a b] - [c d] = [a-d b-c]
Multiplication and division are a bit trickier:
[a b] * [c d] = [min(ac, ad, bc, bd) max(ac, ad, bc, bd)]
[a b] / [c d] = [min(a/c, a/d, b/c, b/d) max(a/c, a/d, b/c, b/d)]
The formulae for open and semi-open intervals are similar.
A quick check:
1 is in the interval (15/16 9/8).
(15/16 9/8) + (15/16 9/8) = (15/8 9/4)
2 is in the interval (15/8 9/4), so 1 + 1 = 2.
So far, so good, but we will soon discover that interval arithmetic has problems. The first hint of a problem is that the intervals are not of uniform size. The intervals at the low end of the line are much smaller than the intervals at the high end. Operations that produce answers much smaller than the inputs will have answers that span several intervals. For example, 20 - 14
20 is in the interval [18 22]
14 is in the interval [13 15]
[18 22] - [13 15] = [3 9]
We have 7 intervals that fall in that range, but none are large enough to cover the result.
The second hint of a problem comes when multiply. It is true that 1+1=2, but does 1*2=2?
1 is in the interval (15/16 9/8)
2 is in the interval (15/8 9/4)
(15/16 9/8) * (15/8 9/4) = (225/128 81/32)
This is clearly not the same as (15/8 9/4). It seems that we have lost the multiplicative identity.
An adherent of the interval model will have to come up with a way to map these arbitrary result intervals back into the floating-point numbers.
In an upcoming post, I'll describe the rational model in more detail.

Monday, June 18, 2007

If you have studied information theory, you might have noticed the question-bits procedure in the last post. It is the binary case ofthe Shannon Entropy. This makes sense: the entropy is the amount of information we can get out of the answer to a yes-or-no question when we have already been told the cluster.
This also supplies us with a good idea of what makes a good cluster: the entropy after clustering should be as small as possible. That is, if the cluster is very homogeneous, then there aren't very many good yes-or-no questions that can narrow down your guess. Everything in the cluster has the same answer to almost all the questions. This would be a good clustering. If the cluster is very heterogeneous, then you'd expect a yes-or-no question to divide the cluster further. This would be a poor clustering.
As it turns out, other people have been down this path: Barbara, Couto, and Li wrote a paper describing `COOLCAT: An entropy-based algorithm for categorical clustering'.
So I guess I gotta write some code to try this out....

Friday, June 15, 2007

Anyone experienced with 20 Questions knows that you should ask questions that tend to cut the remaining possibilities in about half. Computer hackers know that you can distinguish a maximum of 2^20(1048576) distinct objects with twenty yes-or-no questions. Given a set of distinct objects, you need about log2 bits to enumerate them.
Intuitively, a yes-or-no question that divides your sample space in half gives you 1 bit of information.
But what if your question doesn't divide the space evenly? Again, it is intuitive that if your question has the answer `yes' for every object, then your question doesn't help narrow things down at all. This is true if your question has the answer `no' for every object as well. So if we were to plot the amount of information we get from asking a question based upon the percentage of `yes' answers, we'd have a curve that starts and ends at zero and has a peak of 1 bit at exactly the half-way point.
Ok, so suppose your yes-or-no question is `yes' for 1/4 of the objects. If the answer comes out yes, you have gotten the effect of two `binary' questions, which would be 2 bits of information. So the amount of information you get from a question depends on how much you trim the object space. This would be mathematically -log2(x) where x is the ratio of the trimmed space to the original space. For example,-log2(1/2) = 1 bit, 1 bit of information if you trim the space inhalf. -log2(1/4) = 2 bits if you trim the space to 1/4 its originalsize. As a check, -log2(1/1048576) = 20 bits --- if we go from 1048576 down to a single object we have gained 20 bits ofinformation.
But the reason we don't ask questions that trim several bits off the space of objects is because they are usually answered in the negative. If our yes-or-no question is 'yes' for 1/4 of the objects,and the answer comes out `no', we only gain -log2(3/4) = 0.415 bits.(ok, fractional bits are weird) So when we ask a yes-or-no question we need to take into account both possible answers.
Well, since the answer is 'no' 3/4 of the time, we expect to get .415 bits 3/4 of the time, and we expect to get 2 bits the other 1/4 of the time. The average is .81 bits. We can write a formula for computing this sort quantity:

Thursday, June 14, 2007

I don't know much about unsupervised machine learning, so I'm trying to figure it out for myself. This will probably be laughably naive to someone that knows this stuff. Go ahead and laugh.
I got on a Bayesian clustering kick yesterday and read a few papers. Using Unlabeled Data to Improve Text Classification by Kamal Paul Nigam seemed relevent. His thesis was that you could use a little bit of labeled data and a lot of unlabeled data to train a text classifier. I was wondering if this works in the limit of no labeled data at all.
This looked like a good starting point, and I started thinking of the generative model of email and spam. I spent a few hours pondering this when I realized that I'm approaching this incorrectly. I need to understand what a `cluster' is in the first place before I can figure out how to get the machine to recognize one.
There seem to be a number of different theories about how to cluster data. It is hard to decide what is the `best'. They all seem to have a certain amount of `magic', too, usually in the form of inpenetrable math. I thought I knew some math when I went to college, but some of these things look pretty bizarre.
It isn't obvious to me what constitutes good clustering. At one limit, we have one cluster for each and every email that holds exactly that email. The clusters are *very* precise, but not very helpful. At the other limit, we have one cluster for the entire corpus of email. It isn't very precise or helpful. Somewhere in the middle we have a set of clusters each of which most likely contain several messages. There has to be some measure that assigns a high value to this middle ground and a low value to the limiting case.
If you're like me, you're already thinking about drawing circles around groups of little dots on some graph paper, and how to decide the right size of circle, where to put it, etc.
But I'm getting ahead of myself. What exactly *is* a cluster? Why can't I just take every fifth email, put it in a bag and call that a cluster? The point of clustering is to capture some similarity between objects. If I am thinking of a particular email, and I ask you a question about it, you'd have no idea at all what the answer might be. But if I were to tell you that the email I'm thinking of is from my `spam' cluster, then you'd be able to answer a few questions about it. For instance, you might infer that the liklihood that the email solicits money is *much* higher than if you didn't know the clustering.
So if you know that an object belongs to a cluster, you now know something about the object itself. A cluster must have a certain amount of information associated with it.
This is starting to make a little sense. If I were to randomly assign email to clusters, then the fact that an email was in cluster 5 would be fairly meaningless because we'd expect cluster 5 to be more or less the same as cluster 4 and have the same statistical characteristics as the entire corpus of email. So the measure of a good clustering has to be related to the information content of that cluster.
To be continued....

Wednesday, June 13, 2007

I have assets exceeding twenty million dollars in a Nigerian Bank. I don't need to make a withdrawal because I can refinance my home with no money down and get cash back in the process. But this is nothing compared to my 37 inch, um, wallet. And I can stand at attention for hours, much to the delight of the lonely but attractive women who have decided to email me.
Oh. You, too, eh?
A few years back I decided I should be proactive and try to figure out what to do about spam. I read Paul Graham's `Plan for Spam' and decided to make my own spam filter in Emacs. It took a while, mainly because I really wanted to understand the principles behind the Bayesian approach.
Ultimately, I got my filter working, but it just didn't perform as well as I had hoped. I was considering more complex approaches, but the university installed SpamAssassin and it did a pretty good job, and I wasn't getting paid to design spam filters. But now and again I want to pursue an idea that sounded promising to me.
Nigerian 419 spam really doesn't have much in common with the discount drug ads, and neither share many features with the cheap mortgage spams. I did a bit of research and found that the bulk of spam falls into about 10 major catagories. A narrow Bayesian filter could easily pick out one of these catagories, but you need a fairly broad filter to handle them all. But a broad Bayesian filter is more prone to false positives.
It occurred to me that a set of Bayesian filters, each tuned to a particular catagory of spam, might perform much better than a single filter attempting to cover spam in general.
The problem is training such a filter. You need an already classified corpus of examples in order to compute the statistics for a Bayesian filter. But I *hate* this stuff! I'm not going to wade through thousands of spams and try to manually classify them into the various sordid catagories. And I'd have to retrain the filters each time a new scam comes along.
So rather than filtering my mail, I want to categorize it. I want a program to look over my incoming mail and simply group it into buckets based on similarity. The categorizer won't know that a set of emails are spam. It doesn't need to. It will simply place all the similar email into the same bucket. When I go to read my mail, I'll take a peek in each bucket. ``V1AGR@" -- nope, not interested. ``Continuation-passing-style'' --- ok, I'll read the email in this bucket. I'll have the added benefit of auto-generated folders for the email I *want* to read.
But I don't want to have a pre-defined set of categories. I want the machine to perform `unsupervised clustering'. This is a much harder task than simple Bayesian filtering.
I've been thinking about this recently, and I figured I'd share my confusion. My current line of inquiry is looking into Bayesian clustering. Some of the literature is promising, but a friend with more experience in the field said that nothing she saw looked very good.
More to come, I hope.

Thursday, June 7, 2007

On 6/7/07, Grant Rettke wrote:
>
> That said, when I think of a DSL think about letting folks write
> "programs" like:
>
> "trade 100 shares(x) when (time < 20:00) and timingisright()"
>
> When I think about syntax transformation in Lisp I think primarily
> about language features.
In order to talk about domain-specific languages you need a definition of what a language is. Semi-formally, a computer language is a system of syntax and semantics that let you describe a computational process. It is characterised by these features:
1. Primitive constructs, provided ab-initio in the language.
2. Means of combination to create complex elements from the primitives.
3. Means of abstraction to control the resulting complexity.
So a domain-specific language would have primitives that are specific to the domain in question, means of combination that may model the natural combinations in the domain, and means of abstraction that model the natural abstractions in the domain.
To bring this back down to earth, let's consider your `trading language':
"trade 100 shares(x) when (time < 20:00) and timingisright()"
One of the first things I notice about this is that it has some special syntax. There are three approaches to designing programming language syntax. The first is to develop a good understanding of programming language grammars and parsers and then carefully construct a grammar that can be parsed by an LALR(1), or LL(k) parser. The second approach is to `wing it' and make up some ad-hoc syntax involving curly braces, semicolons, and other random punctuation. The third approach is to `punt' and use the parser at hand.
I'm guessing that you took approach number two. That's fine because you aren't actually proposing a language, but rather you are creating a topic of discussion.
I am continually amazed at the fact that many so-called language designers opt for option 2. This leads to strange and bizarre syntax that can be very hard to parse and has ambiguities, `holes' (constructs that *ought* to be expressable but the logical syntax does something different), and `nonsense' (constructs that *are* parseable, but have no logical meaning). Languages such as C++ and Python have these sorts of problems.
Few people choose option 1 for a domain-specific language. It hardly seems worth the effort for a `tiny' language.
Unfortunately, few people choose option 3. Part of the reason is that the parser for the implementation language is not usually separable from the compiler.
For Lisp and Scheme hackers, though, option 3 is a no-brainer. You call `read' and you get back something very close to the AST for your domain specific language. The `drawback' is that your DSL will have a fully parenthesized prefix syntax. Of course Lisp and Scheme hackers don't consider this a drawback at all.
So let's change your DSL slightly to make life easier for us:

When implementing a DSL, you have several strategies: you could write and interpreter for it, you could compile it to machine code, or you could compile it to a different high-level language. Compiling to machine code is unattractive because it is hard to debug, you have to be concerned with linking, binary formats, stack layouts, etc. etc.
Interpretation is a popular choice, but there is an interesting drawback. Almost all DSLs have generic language features. You probably want integers and strings and vectors. You probably want subroutines and variables. A good chunk of your interpreter will be implementing these generic features and only a small part will be doing the special DSL stuff.
Compiling to a different high-level language has a number of advantages. It is easier to read and debug the compiled code, you can make the `primitives' in your DSL be rather high-level constructs in your target language.
Lisp has a leg up on this process. You can compile your DSL into Lisp and dynamically link it to the running Lisp image. You can `steal' the bulk of the generic part of your DSL from the existing Lisp: DSL variables become Lisp variables. DSL expressions become Lisp expressions where possible. Your `compiler' is mostly the identity function, and a handful of macros cover the rest.
You can use the means of combination and means of abstraction as provided by Lisp. This saves you a lot of work in designing the language, and it saves the user a lot of work in learning your language (*if* he already knows lisp, that is).
The real `lightweight' DSLs in Lisp look just like Lisp. They *are* Lisp. The `middleweight' DSLs do a bit more heavy processing in the macro expansion phase, but are *mostly* lisp. `Heavyweight' DSLs, where the language semantics are quite different from lisp, can nonetheless benefit from Lisp syntax. They'll *look* like Lisp, but act a bit differently (FrTime is a good example).
You might even say that nearly all Lisp programs are `flyweight' DSLs.

Wednesday, June 6, 2007

A while back I got hit by a spam worm (Sobig) that sent me tens of thousands of emails. I wanted to plot the amount of email I got as a function of time so I could see how the attack evolved over time. The problem with plotting discrete events (the arrival of a spam) is that the closer you zoom in on the problem, the flatter the plot looks. For instance, if I plotted the number of spams arriving with microsecond accuracy, you'd find that in any microsecond interval that you'd have one or zero emails. During the spam attack, I was getting several emails a second, but after the attack was over, I could go for hours without getting a single message.
I played around with a number of ways to visualize the rate of email arrival before I hit on this method: along the X axis you plot time, along the Y axis you plot the log of the amount of time since the last message (the log of the interarrival time). This makes a very nice plot which shows rapid-fire events as dark clusters near the X axis and periods of calm as sparse dots above the axis. The scale doesn't flatten out as you zoom in, either: the points plotten remain at the same height as you stretch the X axis. This allows you to `zoom in' on the rapid fire events without causing the graph to `flatten out'.
I haven't seen this particular kind of plot used very often, but I have stumbled across it every now and then. For instance, this graph uses the technique. I don't know what it is plotting, but you can see there is definitely more of it on the right-hand side.
I'd be interested in hearing if this kind of plot works for others.

Saturday, June 2, 2007

I've made these points before in different forums, but talking about LOOP has become popular again, so I'll just post a blog entry and point people at it.
Here are the reasons I hate LOOP:
1. It is hard to predict what the resulting code will look like after it is expanded. You may *think* you know what it is doing, but you will probably be fuzzy about the details. This leads to bugs.
2. Loop makes it extremely difficult to write meta-level code that can manipulate loop expressions. A MAP expression is just a function call, but LOOP requires a serious state-machine to parse. You can macro-expand the LOOP away, but the resulting expansion is almost as hard to understand as the original.
3. LOOP operates at a low level of abstraction. One very frequent use of iteration is when you wish to apply an operation to a collection of objects. You shouldn't have to specify how the collection is traversed, and you certainly shouldn't have to specify a serial traversal algorithm if it doesn't matter. LOOP intertwines the operation you wish to perform with the minutiae of traversal --- finding an initial element, stepping to the next element, stopping when done. I don't care about these details. I don't care if the program LOOPs, uses recursion, farms out the work to other processors, or whatever, so long as it produces the answer I want.
4. LOOP encourages users to think linearly, sequentially, and iteratively. They start trying to flatten everything into a single pass, first-to-last, state-bashing mess. Suddenly they forget about recursion. I've seen this happen over and over again. It literally damages the thought process. (which is why I'm suspicious of named-let style loops)