Chapter 2 of my free online book about “Neural Networks and Deep Learning” is now available. The chapter is an in-depth explanation of the backpropagation algorithm. Backpropagation is the workhorse of learning in neural networks, and a key component in modern deep learning systems.﻿ Enjoy!

Many thousands of articles have been written purporting to explain Bitcoin, the online, peer-to-peer currency. Most of those articles give a hand-wavy account of the underlying cryptographic protocol, omitting many details. Even those articles which delve deeper often gloss over crucial points. My aim in this post is to explain the major ideas behind the Bitcoin protocol in a clear, easily comprehensible way. We’ll start from first principles, build up to a broad theoretical understanding of how the protocol works, and then dig down into the nitty-gritty, examining the raw data in a Bitcoin transaction.

Understanding the protocol in this detailed way is hard work. It is tempting instead to take Bitcoin as given, and to engage in speculation about how to get rich with Bitcoin, whether Bitcoin is a bubble, whether Bitcoin might one day mean the end of taxation, and so on. That’s fun, but severely limits your understanding. Understanding the details of the Bitcoin protocol opens up otherwise inaccessible vistas. In particular, it’s the basis for understanding Bitcoin’s built-in scripting language, which makes it possible to use Bitcoin to create new types of financial instruments, such as smart contracts. New financial instruments can, in turn, be used to create new markets and to enable new forms of collective human behaviour. Talk about fun!

I’ll describe Bitcoin scripting and concepts such as smart contracts in future posts. This post concentrates on explaining the nuts-and-bolts of the Bitcoin protocol. To understand the post, you need to be comfortable with public key cryptography, and with the closely related idea of digital signatures. I’ll also assume you’re familiar with cryptographic hashing. None of this is especially difficult. The basic ideas can be taught in freshman university mathematics or computer science classes. The ideas are beautiful, so if you’re not familiar with them, I recommend taking a few hours to get familiar.

It may seem surprising that Bitcoin’s basis is cryptography. Isn’t Bitcoin a currency, not a way of sending secret messages? In fact, the problems Bitcoin needs to solve are largely about securing transactions — making sure people can’t steal from one another, or impersonate one another, and so on. In the world of atoms we achieve security with devices such as locks, safes, signatures, and bank vaults. In the world of bits we achieve this kind of security with cryptography. And that’s why Bitcoin is at heart a cryptographic protocol.

My strategy in the post is to build Bitcoin up in stages. I’ll begin by explaining a very simple digital currency, based on ideas that are almost obvious. We’ll call that currency Infocoin, to distinguish it from Bitcoin. Of course, our first version of Infocoin will have many deficiencies, and so we’ll go through several iterations of Infocoin, with each iteration introducing just one or two simple new ideas. After several such iterations, we’ll arrive at the full Bitcoin protocol. We will have reinvented Bitcoin!

This strategy is slower than if I explained the entire Bitcoin protocol in one shot. But while you can understand the mechanics of Bitcoin through such a one-shot explanation, it would be difficult to understand why Bitcoin is designed the way it is. The advantage of the slower iterative explanation is that it gives us a much sharper understanding of each element of Bitcoin.

Finally, I should mention that I’m a relative newcomer to Bitcoin. I’ve been following it loosely since 2011 (and cryptocurrencies since the late 1990s), but only got seriously into the details of the Bitcoin protocol earlier this year. So I’d certainly appreciate corrections of any misapprehensions on my part. Also in the post I’ve included a number of “problems for the author” – notes to myself about questions that came up during the writing. You may find these interesting, but you can also skip them entirely without losing track of the main text.

First steps: a signed letter of intent

So how can we design a digital currency?

On the face of it, a digital currency sounds impossible. Suppose some person – let’s call her Alice – has some digital money which she wants to spend. If Alice can use a string of bits as money, how can we prevent her from using the same bit string over and over, thus minting an infinite supply of money? Or, if we can somehow solve that problem, how can we prevent someone else forging such a string of bits, and using that to steal from Alice?

These are just two of the many problems that must be overcome in order to use information as money.

As a first version of Infocoin, let’s find a way that Alice can use a string of bits as a (very primitive and incomplete) form of money, in a way that gives her at least some protection against forgery. Suppose Alice wants to give another person, Bob, an infocoin. To do this, Alice writes down the message “I, Alice, am giving Bob one infocoin”. She then digitally signs the message using a private cryptographic key, and announces the signed string of bits to the entire world.

(By the way, I’m using capitalized “Infocoin” to refer to the protocol and general concept, and lowercase “infocoin” to refer to specific denominations of the currency. A similar useage is common, though not universal, in the Bitcoin world.)

This isn’t terribly impressive as a prototype digital currency! But it does have some virtues. Anyone in the world (including Bob) can use Alice’s public key to verify that Alice really was the person who signed the message “I, Alice, am giving Bob one infocoin”. No-one else could have created that bit string, and so Alice can’t turn around and say “No, I didn’t mean to give Bob an infocoin”. So the protocol establishes that Alice truly intends to give Bob one infocoin. The same fact – no-one else could compose such a signed message – also gives Alice some limited protection from forgery. Of course, after Alice has published her message it’s possible for other people to duplicate the message, so in that sense forgery is possible. But it’s not possible from scratch. These two properties – establishment of intent on Alice’s part, and the limited protection from forgery – are genuinely notable features of this protocol.

I haven’t (quite) said exactly what digital money is in this protocol. To make this explicit: it’s just the message itself, i.e., the string of bits representing the digitally signed message “I, Alice, am giving Bob one infocoin”. Later protocols will be similar, in that all our forms of digital money will be just more and more elaborate messages [1].

Using serial numbers to make coins uniquely identifiable

A problem with the first version of Infocoin is that Alice could keep sending Bob the same signed message over and over. Suppose Bob receives ten copies of the signed message “I, Alice, am giving Bob one infocoin”. Does that mean Alice sent Bob ten different infocoins? Was her message accidentally duplicated? Perhaps she was trying to trick Bob into believing that she had given him ten different infocoins, when the message only proves to the world that she intends to transfer one infocoin.

What we’d like is a way of making infocoins unique. They need a label or serial number. Alice would sign the message “I, Alice, am giving Bob one infocoin, with serial number 8740348″. Then, later, Alice could sign the message “I, Alice, am giving Bob one infocoin, with serial number 8770431″, and Bob (and everyone else) would know that a different infocoin was being transferred.

To make this scheme work we need a trusted source of serial numbers for the infocoins. One way to create such a source is to introduce a bank. This bank would provide serial numbers for infocoins, keep track of who has which infocoins, and verify that transactions really are legitimate,

In more detail, let’s suppose Alice goes into the bank, and says “I want to withdraw one infocoin from my account”. The bank reduces her account balance by one infocoin, and assigns her a new, never-before used serial number, let’s say 1234567. Then, when Alice wants to transfer her infocoin to Bob, she signs the message “I, Alice, am giving Bob one infocoin, with serial number 1234567″. But Bob doesn’t just accept the infocoin. Instead, he contacts the bank, and verifies that: (a) the infocoin with that serial number belongs to Alice; and (b) Alice hasn’t already spent the infocoin. If both those things are true, then Bob tells the bank he wants to accept the infocoin, and the bank updates their records to show that the infocoin with that serial number is now in Bob’s possession, and no longer belongs to Alice.

Making everyone collectively the bank

This last solution looks pretty promising. However, it turns out that we can do something much more ambitious. We can eliminate the bank entirely from the protocol. This changes the nature of the currency considerably. It means that there is no longer any single organization in charge of the currency. And when you think about the enormous power a central bank has – control over the money supply – that’s a pretty huge change.

The idea is to make it so everyone (collectively) is the bank. In particular, we’ll assume that everyone using Infocoin keeps a complete record of which infocoins belong to which person. You can think of this as a shared public ledger showing all Infocoin transactions. We’ll call this ledger the block chain, since that’s what the complete record will be called in Bitcoin, once we get to it.

Now, suppose Alice wants to transfer an infocoin to Bob. She signs the message “I, Alice, am giving Bob one infocoin, with serial number 1234567″, and gives the signed message to Bob. Bob can use his copy of the block chain to check that, indeed, the infocoin is Alice’s to give. If that checks out then he broadcasts both Alice’s message and his acceptance of the transaction to the entire network, and everyone updates their copy of the block chain.

We still have the “where do serial number come from” problem, but that turns out to be pretty easy to solve, and so I will defer it to later, in the discussion of Bitcoin. A more challenging problem is that this protocol allows Alice to cheat by double spending her infocoin. She sends the signed message “I, Alice, am giving Bob one infocoin, with serial number 1234567″ to Bob, and the message”I, Alice, am giving Charlie one infocoin, with [the same] serial number 1234567″ to Charlie. Both Bob and Charlie use their copy of the block chain to verify that the infocoin is Alice’s to spend. Provided they do this verification at nearly the same time (before they’ve had a chance to hear from one another), both will find that, yes, the block chain shows the coin belongs to Alice. And so they will both accept the transaction, and also broadcast their acceptance of the transaction. Now there’s a problem. How should other people update their block chains? There may be no easy way to achieve a consistent shared ledger of transactions. And even if everyone can agree on a consistent way to update their block chains, there is still the problem that either Bob or Charlie will be cheated.

At first glance double spending seems difficult for Alice to pull off. After all, if Alice sends the message first to Bob, then Bob can verify the message, and tell everyone else in the network (including Charlie) to update their block chain. Once that has happened, Charlie would no longer be fooled by Alice. So there is most likely only a brief period of time in which Alice can double spend. However, it’s obviously undesirable to have any such a period of time. Worse, there are techniques Alice could use to make that period longer. She could, for example, use network traffic analysis to find times when Bob and Charlie are likely to have a lot of latency in communication. Or perhaps she could do something to deliberately disrupt their communications. If she can slow communication even a little that makes her task of double spending much easier.

How can we address the problem of double spending? The obvious solution is that when Alice sends Bob an infocoin, Bob shouldn’t try to verify the transaction alone. Rather, he should broadcast the possible transaction to the entire network of Infocoin users, and ask them to help determine whether the transaction is legitimate. If they collectively decide that the transaction is okay, then Bob can accept the infocoin, and everyone will update their block chain. This type of protocol can help prevent double spending, since if Alice tries to spend her infocoin with both Bob and Charlie, other people on the network will notice, and network users will tell both Bob and Charlie that there is a problem with the transaction, and the transaction shouldn’t go through.

In more detail, let’s suppose Alice wants to give Bob an infocoin. As before, she signs the message “I, Alice, am giving Bob one infocoin, with serial number 1234567″, and gives the signed message to Bob. Also as before, Bob does a sanity check, using his copy of the block chain to check that, indeed, the coin currently belongs to Alice. But at that point the protocol is modified. Bob doesn’t just go ahead and accept the transaction. Instead, he broadcasts Alice’s message to the entire network. Other members of the network check to see whether Alice owns that infocoin. If so, they broadcast the message “Yes, Alice owns infocoin 1234567, it can now be transferred to Bob.” Once enough people have broadcast that message, everyone updates their block chain to show that infocoin 1234567 now belongs to Bob, and the transaction is complete.

This protocol has many imprecise elements at present. For instance, what does it mean to say “once enough people have broadcast that message”? What exactly does “enough” mean here? It can’t mean everyone in the network, since we don’t a priori know who is on the Infocoin network. For the same reason, it can’t mean some fixed fraction of users in the network. We won’t try to make these ideas precise right now. Instead, in the next section I’ll point out a serious problem with the approach as described. Fixing that problem will at the same time have the pleasant side effect of making the ideas above much more precise.

Proof-of-work

Suppose Alice wants to double spend in the network-based protocol I just described. She could do this by taking over the Infocoin network. Let’s suppose she uses an automated system to set up a large number of separate identities, let’s say a billion, on the Infocoin network. As before, she tries to double spend the same infocoin with both Bob and Charlie. But when Bob and Charlie ask the network to validate their respective transactions, Alice’s sock puppet identities swamp the network, announcing to Bob that they’ve validated his transaction, and to Charlie that they’ve validated his transaction, possibly fooling one or both into accepting the transaction.

There’s a clever way of avoiding this problem, using an idea known as proof-of-work. The idea is counterintuitive and involves a combination of two ideas: (1) to (artificially) make it computationally costly for network users to validate transactions; and (2) to reward them for trying to help validate transactions. The reward is used so that people on the network will try to help validate transactions, even though that’s now been made a computationally costly process. The benefit of making it costly to validate transactions is that validation can no longer be influenced by the number of network identities someone controls, but only by the total computational power they can bring to bear on validation. As we’ll see, with some clever design we can make it so a cheater would need enormous computational resources to cheat, making it impractical.

That’s the gist of proof-of-work. But to really understand proof-of-work, we need to go through the details.

Suppose Alice broadcasts to the network the news that “I, Alice, am giving Bob one infocoin, with serial number 1234567″.

As other people on the network hear that message, each adds it to a queue of pending transactions that they’ve been told about, but which haven’t yet been approved by the network. For instance, another network user named David might have the following queue of pending transactions:

David checks his copy of the block chain, and can see that each transaction is valid. He would like to help out by broadcasting news of that validity to the entire network.

However, before doing that, as part of the validation protocol David is required to solve a hard computational puzzle – the proof-of-work. Without the solution to that puzzle, the rest of the network won’t accept his validation of the transaction.

What puzzle does David need to solve? To explain that, let be a fixed hash function known by everyone in the network – it’s built into the protocol. Bitcoin uses the well-known SHA-256 hash function, but any cryptographically secure hash function will do. Let’s give David’s queue of pending transactions a label, , just so it’s got a name we can refer to. Suppose David appends a number (called the nonce) to and hashes the combination. For example, if we use “Hello, world!” (obviously this is not a list of transactions, just a string used for illustrative purposes) and the nonce then (output is in hexadecimal)

The puzzle David has to solve – the proof-of-work – is to find a nonce such that when we append to and hash the combination the output hash begins with a long run of zeroes. The puzzle can be made more or less difficult by varying the number of zeroes required to solve the puzzle. A relatively simple proof-of-work puzzle might require just three or four zeroes at the start of the hash, while a more difficult proof-of-work puzzle might require a much longer run of zeros, say 15 consecutive zeroes. In either case, the above attempt to find a suitable nonce, with , is a failure, since the output doesn’t begin with any zeroes at all. Trying doesn’t work either:

This nonce gives us a string of four zeroes at the beginning of the output of the hash. This will be enough to solve a simple proof-of-work puzzle, but not enough to solve a more difficult proof-of-work puzzle.

What makes this puzzle hard to solve is the fact that the output from a cryptographic hash function behaves like a random number: change the input even a tiny bit and the output from the hash function changes completely, in a way that’s hard to predict. So if we want the output hash value to begin with 10 zeroes, say, then David will need, on average, to try different values for before he finds a suitable nonce. That’s a pretty challenging task, requiring lots of computational power.

Obviously, it’s possible to make this puzzle more or less difficult to solve by requiring more or fewer zeroes in the output from the hash function. In fact, the Bitcoin protocol gets quite a fine level of control over the difficulty of the puzzle, by using a slight variation on the proof-of-work puzzle described above. Instead of requiring leading zeroes, the Bitcoin proof-of-work puzzle requires the hash of a block’s header to be lower than or equal to a number known as the target. This target is automatically adjusted to ensure that a Bitcoin block takes, on average, about ten minutes to validate.

(In practice there is a sizeable randomness in how long it takes to validate a block – sometimes a new block is validated in just a minute or two, other times it may take 20 minutes or even longer. It’s straightforward to modify the Bitcoin protocol so that the time to validation is much more sharply peaked around ten minutes. Instead of solving a single puzzle, we can require that multiple puzzles be solved; with some careful design it is possible to considerably reduce the variance in the time to validate a block of transactions.)

Alright, let’s suppose David is lucky and finds a suitable nonce, . Celebration! (He’ll be rewarded for finding the nonce, as described below). He broadcasts the block of transactions he’s approving to the network, together with the value for . Other participants in the Infocoin network can verify that is a valid solution to the proof-of-work puzzle. And they then update their block chains to include the new block of transactions.

For the proof-of-work idea to have any chance of succeeding, network users need an incentive to help validate transactions. Without such an incentive, they have no reason to expend valuable computational power, merely to help validate other people’s transactions. And if network users are not willing to expend that power, then the whole system won’t work. The solution to this problem is to reward people who help validate transactions. In particular, suppose we reward whoever successfully validates a block of transactions by crediting them with some infocoins. Provided the infocoin reward is large enough that will give them an incentive to participate in validation.

In the Bitcoin protocol, this validation process is called mining. For each block of transactions validated, the successful miner receives a bitcoin reward. Initially, this was set to be a 50 bitcoin reward. But for every 210,000 validated blocks (roughly, once every four years) the reward halves. This has happened just once, to date, and so the current reward for mining a block is 25 bitcoins. This halving in the rate will continue every four years until the year 2140 CE. At that point, the reward for mining will drop below bitcoins per block. bitcoins is actually the minimal unit of Bitcoin, and is known as a satoshi. So in 2140 CE the total supply of bitcoins will cease to increase. However, that won’t eliminate the incentive to help validate transactions. Bitcoin also makes it possible to set aside some currency in a transaction as a transaction fee, which goes to the miner who helps validate it. In the early days of Bitcoin transaction fees were mostly set to zero, but as Bitcoin has gained in popularity, transaction fees have gradually risen, and are now a substantial additional incentive on top of the 25 bitcoin reward for mining a block.

You can think of proof-of-work as a competition to approve transactions. Each entry in the competition costs a little bit of computing power. A miner’s chance of winning the competition is (roughly, and with some caveats) equal to the proportion of the total computing power that they control. So, for instance, if a miner controls one percent of the computing power being used to validate Bitcoin transactions, then they have roughly a one percent chance of winning the competition. So provided a lot of computing power is being brought to bear on the competition, a dishonest miner is likely to have only a relatively small chance to corrupt the validation process, unless they expend a huge amount of computing resources.

Of course, while it’s encouraging that a dishonest party has only a relatively small chance to corrupt the block chain, that’s not enough to give us confidence in the currency. In particular, we haven’t yet conclusively addressed the issue of double spending.

I’ll analyse double spending shortly. Before doing that, I want to fill in an important detail in the description of Infocoin. We’d ideally like the Infocoin network to agree upon the order in which transactions have occurred. If we don’t have such an ordering then at any given moment it may not be clear who owns which infocoins. To help do this we’ll require that new blocks always include a pointer to the last block validated in the chain, in addition to the list of transactions in the block. (The pointer is actually just a hash of the previous block). So typically the block chain is just a linear chain of blocks of transactions, one after the other, with later blocks each containing a pointer to the immediately prior block:

Occasionally, a fork will appear in the block chain. This can happen, for instance, if by chance two miners happen to validate a block of transactions near-simultaneously – both broadcast their newly-validated block out to the network, and some people update their block chain one way, and others update their block chain the other way:

This causes exactly the problem we’re trying to avoid – it’s no longer clear in what order transactions have occurred, and it may not be clear who owns which infocoins. Fortunately, there’s a simple idea that can be used to remove any forks. The rule is this: if a fork occurs, people on the network keep track of both forks. But at any given time, miners only work to extend whichever fork is longest in their copy of the block chain.

Suppose, for example, that we have a fork in which some miners receive block A first, and some miners receive block B first. Those miners who receive block A first will continue mining along that fork, while the others will mine along fork B. Let’s suppose that the miners working on fork B are the next to successfully mine a block:

After they receive news that this has happened, the miners working on fork A will notice that fork B is now longer, and will switch to working on that fork. Presto, in short order work on fork A will cease, and everyone will be working on the same linear chain, and block A can be ignored. Of course, any still-pending transactions in A will still be pending in the queues of the miners working on fork B, and so all transactions will eventually be validated.

Likewise, it may be that the miners working on fork A are the first to extend their fork. In that case work on fork B will quickly cease, and again we have a single linear chain.

No matter what the outcome, this process ensures that the block chain has an agreed-upon time ordering of the blocks. In Bitcoin proper, a transaction is not considered confirmed until: (1) it is part of a block in the longest fork, and (2) at least 5 blocks follow it in the longest fork. In this case we say that the transaction has “6 confirmations”. This gives the network time to come to an agreed-upon the ordering of the blocks. We’ll also use this strategy for Infocoin.

With the time-ordering now understood, let’s return to think about what happens if a dishonest party tries to double spend. Suppose Alice tries to double spend with Bob and Charlie. One possible approach is for her to try to validate a block that includes both transactions. Assuming she has one percent of the computing power, she will occasionally get lucky and validate the block by solving the proof-of-work. Unfortunately for Alice, the double spending will be immediately spotted by other people in the Infocoin network and rejected, despite solving the proof-of-work problem. So that’s not something we need to worry about.

A more serious problem occurs if she broadcasts two separate transactions in which she spends the same infocoin with Bob and Charlie, respectively. She might, for example, broadcast one transaction to a subset of the miners, and the other transaction to another set of miners, hoping to get both transactions validated in this way. Fortunately, in this case, as we’ve seen, the network will eventually confirm one of these transactions, but not both. So, for instance, Bob’s transaction might ultimately be confirmed, in which case Bob can go ahead confidently. Meanwhile, Charlie will see that his transaction has not been confirmed, and so will decline Alice’s offer. So this isn’t a problem either. In fact, knowing that this will be the case, there is little reason for Alice to try this in the first place.

An important variant on double spending is if Alice = Bob, i.e., Alice tries to spend a coin with Charlie which she is also “spending” with herself (i.e., giving back to herself). This sounds like it ought to be easy to detect and deal with, but, of course, it’s easy on a network to set up multiple identities associated with the same person or organization, so this possibility needs to be considered. In this case, Alice’s strategy is to wait until Charlie accepts the infocoin, which happens after the transaction has been confirmed 6 times in the longest chain. She will then attempt to fork the chain before the transaction with Charlie, adding a block which includes a transaction in which she pays herself:

Unfortunately for Alice, it’s now very difficult for her to catch up with the longer fork. Other miners won’t want to help her out, since they’ll be working on the longer fork. And unless Alice is able to solve the proof-of-work at least as fast as everyone else in the network combined – roughly, that means controlling more than fifty percent of the computing power – then she will just keep falling further and further behind. Of course, she might get lucky. We can, for example, imagine a scenario in which Alice controls one percent of the computing power, but happens to get lucky and finds six extra blocks in a row, before the rest of the network has found any extra blocks. In this case, she might be able to get ahead, and get control of the block chain. But this particular event will occur with probability . A more general analysis along these lines shows that Alice’s probability of ever catching up is infinitesimal, unless she is able to solve proof-of-work puzzles at a rate approaching all other miners combined.

Of course, this is not a rigorous security analysis showing that Alice cannot double spend. It’s merely an informal plausibility argument. The original paper introducing Bitcoin did not, in fact, contain a rigorous security analysis, only informal arguments along the lines I’ve presented here. The security community is still analysing Bitcoin, and trying to understand possible vulnerabilities. You can see some of this research listed here, and I mention a few related problems in the “Problems for the author” below. At this point I think it’s fair to say that the jury is still out on how secure Bitcoin is.

The proof-of-work and mining ideas give rise to many questions. How much reward is enough to persuade people to mine? How does the change in supply of infocoins affect the Infocoin economy? Will Infocoin mining end up concentrated in the hands of a few, or many? If it’s just a few, doesn’t that endanger the security of the system? Presumably transaction fees will eventually equilibriate – won’t this introduce an unwanted source of friction, and make small transactions less desirable? These are all great questions, but beyond the scope of this post. I may come back to the questions (in the context of Bitcoin) in a future post. For now, we’ll stick to our focus on understanding how the Bitcoin protocol works.

Problems for the author

I don’t understand why double spending can’t be prevented in a simpler manner using two-phase commit. Suppose Alice tries to double spend an infocoin with both Bob and Charlie. The idea is that Bob and Charlie would each broadcast their respective messages to the Infocoin network, along with a request: “Should I accept this?” They’d then wait some period – perhaps ten minutes – to hear any naysayers who could prove that Alice was trying to double spend. If no such nays are heard (and provided there are no signs of attempts to disrupt the network), they’d then accept the transaction. This protocol needs to be hardened against network attacks, but it seems to me to be the core of a good alternate idea. How well does this work? What drawbacks and advantages does it have compared to the full Bitcoin protocol?

Early in the section I mentioned that there is a natural way of reducing the variance in time required to validate a block of transactions. If that variance is reduced too much, then it creates an interesting attack possibility. Suppose Alice tries to fork the chain in such a way that: (a) one fork starts with a block in which Alice pays herself, while the other fork starts with a block in which Alice pays Bob; (b) both blocks are announced nearly simultaneously, so roughly half the miners will attempt to mine each fork; (c) Alice uses her mining power to try to keep the forks of roughly equal length, mining whichever fork is shorter – this is ordinarily hard to pull off, but becomes significantly easier if the standard deviation of the time-to-validation is much shorter than the network latency; (d) after 5 blocks have been mined on both forks, Alice throws her mining power into making it more likely that Charles’s transaction is confirmed; and (e) after confirmation of Charles’s transaction, she then throws her computational power into the other fork, and attempts to regain the lead. This balancing strategy will have only a small chance of success. But while the probability is small, it will certainly be much larger than in the standard protocol, with high variance in the time to validate a block. Is there a way of avoiding this problem?

Suppose Bitcoin mining software always explored nonces starting with , then . If this is done by all (or even just a substantial fraction) of Bitcoin miners then it creates a vulnerability. Namely, it’s possible for someone to improve their odds of solving the proof-of-work merely by starting with some other (much larger) nonce. More generally, it may be possible for attackers to exploit any systematic patterns in the way miners explore the space of nonces. More generally still, in the analysis of this section I have implicitly assumed a kind of symmetry between different miners. In practice, there will be asymmetries and a thorough security analysis will need to account for those asymmetries.

Bitcoin

Let’s move away from Infocoin, and describe the actual Bitcoin protocol. There are a few new ideas here, but with one exception (discussed below) they’re mostly obvious modifications to Infocoin.

To use Bitcoin in practice, you first install a wallet program on your computer. To give you a sense of what that means, here’s a screenshot of a wallet called Multbit. You can see the Bitcoin balance on the left — 0.06555555 Bitcoins, or about 70 dollars at the exchange rate on the day I took this screenshot — and on the right two recent transactions, which deposited those 0.06555555 Bitcoins:

Suppose you’re a merchant who has set up an online store, and you’ve decided to allow people to pay using Bitcoin. What you do is tell your wallet program to generate a Bitcoin address. In response, it will generate a public / private key pair, and then hash the public key to form your Bitcoin address:

You then send your Bitcoin address to the person who wants to buy from you. You could do this in email, or even put the address up publicly on a webpage. This is safe, since the address is merely a hash of your public key, which can safely be known by the world anyway. (I’ll return later to the question of why the Bitcoin address is a hash, and not just the public key.)

The person who is going to pay you then generates a transaction. Let’s take a look at the data from an actual transaction transferring bitcoins. What’s shown below is very nearly the raw data. It’s changed in three ways: (1) the data has been deserialized; (2) line numbers have been added, for ease of reference; and (3) I’ve abbreviated various hashes and public keys, just putting in the first six hexadecimal digits of each, when in reality they are much longer. Here’s the data:

Line 1 contains the hash of the remainder of the transaction, 7c4025..., expressed in hexadecimal. This is used as an identifier for the transaction.

Line 2 tells us that this is a transaction in version 1 of the Bitcoin protocol.

Lines 3 and 4 tell us that the transaction has one input and one output, respectively. I’ll talk below about transactions with more inputs and outputs, and why that’s useful.

Line 5 contains the value for lock_time, which can be used to control when a transaction is finalized. For most Bitcoin transactions being carried out today the lock_time is set to 0, which means the transaction is finalized immediately.

Line 6 tells us the size (in bytes) of the transaction. Note that it’s not the monetary amount being transferred! That comes later.

Lines 7 through 11 define the input to the transaction. In particular, lines 8 through 10 tell us that the input is to be taken from the output from an earlier transaction, with the given hash, which is expressed in hexadecimal as 2007ae.... The n=0 tells us it’s to be the first output from that transaction; we’ll see soon how multiple outputs (and inputs) from a transaction work, so don’t worry too much about this for now. Line 11 contains the signature of the person sending the money, 304502..., followed by a space, and then the corresponding public key, 04b2d.... Again, these are both in hexadecimal.

One thing to note about the input is that there’s nothing explicitly specifying how many bitcoins from the previous transaction should be spent in this transaction. In fact, all the bitcoins from the n=0th output of the previous transaction are spent. So, for example, if the n=0th output of the earlier transaction was 2 bitcoins, then 2 bitcoins will be spent in this transaction. This seems like an inconvenient restriction – like trying to buy bread with a 20 dollar note, and not being able to break the note down. The solution, of course, is to have a mechanism for providing change. This can be done using transactions with multiple inputs and outputs, which we’ll discuss in the next section.

Lines 12 through 14 define the output from the transaction. In particular, line 13 tells us the value of the output, 0.319 bitcoins. Line 14 is somewhat complicated. The main thing to note is that the string a7db6f... is the Bitcoin address of the intended recipient of the funds (written in hexadecimal). In fact, Line 14 is actually an expression in Bitcoin’s scripting language. I’m not going to describe that language in detail in this post, the important thing to take away now is just that a7db6f... is the Bitcoin address.

You can now see, by the way, how Bitcoin addresses the question I swept under the rug in the last section: where do Bitcoin serial numbers come from? In fact, the role of the serial number is played by transaction hashes. In the transaction above, for example, the recipient is receiving 0.319 Bitcoins, which come out of the first output of an earlier transaction with hash 2007ae... (line 9). If you go and look in the block chain for that transaction, you’d see that its output comes from a still earlier transaction. And so on.

There are two clever things about using transaction hashes instead of serial numbers. First, in Bitcoin there’s not really any separate, persistent “coins” at all, just a long series of transactions in the block chain. It’s a clever idea to realize that you don’t need persistent coins, and can just get by with a ledger of transactions. Second, by operating in this way we remove the need for any central authority issuing serial numbers. Instead, the serial numbers can be self-generated, merely by hashing the transaction.

In fact, it’s possible to keep following the chain of transactions further back in history. Ultimately, this process must terminate. This can happen in one of two ways. The first possibilitty is that you’ll arrive at the very first Bitcoin transaction, contained in the so-called Genesis block. This is a special transaction, having no inputs, but a 50 Bitcoin output. In other words, this transaction establishes an initial money supply. The Genesis block is treated separately by Bitcoin clients, and I won’t get into the details here, although it’s along similar lines to the transaction above. You can see the deserialized raw data here, and read about the Genesis block here.

The second possibility when you follow a chain of transactions back in time is that eventually you’ll arrive at a so-called coinbase transaction. With the exception of the Genesis block, every block of transactions in the block chain starts with a special coinbase transaction. This is the transaction rewarding the miner who validated that block of transactions. It uses a similar but not identical format to the transaction above. I won’t go through the format in detail, but if you want to see an example, see here. You can read a little more about coinbase transactions here.

Something I haven’t been precise about above is what exactly is being signed by the digital signature in line 11. The obvious thing to do is for the payer to sign the whole transaction (apart from the transaction hash, which, of course, must be generated later). Currently, this is not what is done – some pieces of the transaction are omitted. This makes some pieces of the transaction malleable, i.e., they can be changed later. However, this malleability does not include the amounts being paid out, senders and recipients, which can’t be changed later. I must admit I haven’t dug down into the details here. I gather that this malleability is under discussion in the Bitcoin developer community, and there are efforts afoot to reduce or eliminate this malleability.

Transactions with multiple inputs and outputs

In the last section I described how a transaction with a single input and a single output works. In practice, it’s often extremely convenient to create Bitcoin transactions with multiple inputs or multiple outputs. I’ll talk below about why this can be useful. But first let’s take a look at the data from an actual transaction:

Let’s go through the data, line by line. It’s very similar to the single-input-single-output transaction, so I’ll do this pretty quickly.

Line 1 contains the hash of the remainder of the transaction. This is used as an identifier for the transaction.

Line 2 tells us that this is a transaction in version 1 of the Bitcoin protocol.

Lines 3 and 4 tell us that the transaction has three inputs and two outputs, respectively.

Line 5 contains the lock_time. As in the single-input-single-output case this is set to 0, which means the transaction is finalized immediately.

Line 6 tells us the size of the transaction in bytes.

Lines 7 through 19 define a list of the inputs to the transaction. Each corresponds to an output from a previous Bitcoin transaction.

The first input is defined in lines 8 through 11.

In particular, lines 8 through 10 tell us that the input is to be taken from the n=0th output from the transaction with hash3beabc.... Line 11 contains the signature, followed by a space, and then the public key of the person sending the bitcoins.

Lines 12 through 15 define the second input, with a similar format to lines 8 through 11. And lines 16 through 19 define the third input.

Lines 20 through 24 define a list containing the two outputs from the transaction.

The first output is defined in lines 21 and 22. Line 21 tells us the value of the output, 0.01068000 bitcoins. As before, line 22 is an expression in Bitcoin’s scripting language. The main thing to take away here is that the string e8c30622... is the Bitcoin address of the intended recipient of the funds.

The second output is defined lines 23 and 24, with a similar format to the first output.

One apparent oddity in this description is that although each output has a Bitcoin value associated to it, the inputs do not. Of course, the values of the respective inputs can be found by consulting the corresponding outputs in earlier transactions. In a standard Bitcoin transaction, the sum of all the inputs in the transaction must be at least as much as the sum of all the outputs. (The only exception to this principle is the Genesis block, and in coinbase transactions, both of which add to the overall Bitcoin supply.) If the inputs sum up to more than the outputs, then the excess is used as a transaction fee. This is paid to whichever miner successfully validates the block which the current transaction is a part of.

That’s all there is to multiple-input-multiple-output transactions! They’re a pretty simple variation on single-input-single-output-transactions.

One nice application of multiple-input-multiple-output transactions is the idea of change. Suppose, for example, that I want to send you 0.15 bitcoins. I can do so by spending money from a previous transaction in which I received 0.2 bitcoins. Of course, I don’t want to send you the entire 0.2 bitcoins. The solution is to send you 0.15 bitcoins, and to send 0.05 bitcoins to a Bitcoin address which I own. Those 0.05 bitcoins are the change. Of course, it differs a little from the change you might receive in a store, since change in this case is what you pay yourself. But the broad idea is similar.

Conclusion

That completes a basic description of the main ideas behind Bitcoin. Of course, I’ve omitted many details – this isn’t a formal specification. But I have described the main ideas behind the most common use cases for Bitcoin.

While the rules of Bitcoin are simple and easy to understand, that doesn’t mean that it’s easy to understand all the consequences of the rules. There is vastly more that could be said about Bitcoin, and I’ll investigate some of these issues in future posts.

For now, though, I’ll wrap up by addressing a few loose ends.

How anonymous is Bitcoin? Many people claim that Bitcoin can be used anonymously. This claim has led to the formation of marketplaces such as Silk Road (and various successors), which specialize in illegal goods. However, the claim that Bitcoin is anonymous is a myth. The block chain is public, meaning that it’s possible for anyone to see every Bitcoin transaction ever. Although Bitcoin addresses aren’t immediately associated to real-world identities, computer scientists have done a great deal of work figuring out how to de-anonymize “anonymous” social networks. The block chain is a marvellous target for these techniques. I will be extremely surprised if the great majority of Bitcoin users are not identified with relatively high confidence and ease in the near future. The confidence won’t be high enough to achieve convictions, but will be high enough to identify likely targets. Furthermore, identification will be retrospective, meaning that someone who bought drugs on Silk Road in 2011 will still be identifiable on the basis of the block chain in, say, 2020. These de-anonymization techniques are well known to computer scientists, and, one presumes, therefore to the NSA. I would not be at all surprised if the NSA and other agencies have already de-anonymized many users. It is, in fact, ironic that Bitcoin is often touted as anonymous. It’s not. Bitcoin is, instead, perhaps the most open and transparent financial instrument the world has ever seen.

Can you get rich with Bitcoin? Well, maybe. Tim O’Reilly once said: “Money is like gas in the car – you need to pay attention or you’ll end up on the side of the road – but a well-lived life is not a tour of gas stations!” Much of the interest in Bitcoin comes from people whose life mission seems to be to find a really big gas station. I must admit I find this perplexing. What is, I believe, much more interesting and enjoyable is to think of Bitcoin and other cryptocurrencies as a way of enabling new forms of collective behaviour. That’s intellectually fascinating, offers marvellous creative possibilities, is socially valuable, and may just also put some money in the bank. But if money in the bank is your primary concern, then I believe that other strategies are much more likely to succeed.

Details I’ve omitted: Although this post has described the main ideas behind Bitcoin, there are many details I haven’t mentioned. One is a nice space-saving trick used by the protocol, based on a data structure known as a Merkle tree. It’s a detail, but a splendid detail, and worth checking out if fun data structures are your thing. You can get an overview in the original Bitcoin paper. Second, I’ve said little about the Bitcoin network – questions like how the network deals with denial of service attacks, how nodes join and leave the network, and so on. This is a fascinating topic, but it’s also something of a mess of details, and so I’ve omitted it. You can read more about it at some of the links above.

Bitcoin scripting: In this post I’ve explained Bitcoin as a form of digital, online money. But this is only a small part of a much bigger and more interesting story. As we’ve seen, every Bitcoin transaction is associated to a script in the Bitcoin programming language. The scripts we’ve seen in this post describe simple transactions like “Alice gave Bob 10 bitcoins”. But the scripting language can also be used to express far more complicated transactions. To put it another way, Bitcoin is programmable money. In later posts I will explain the scripting system, and how it is possible to use Bitcoin scripting as a platform to experiment with all sorts of amazing financial instruments.

Thanks for reading. Enjoy the essay? You can tip me with Bitcoin (!) at address: 17ukkKt1bNLAqdJ1QQv8v9Askr6vy3MzTZ. You may also enjoy the first chapter of my forthcoming book on neural networks and deep learning, and may wish to follow me on Twitter.

Footnote

[1] In the United States the question “Is money a form of speech?” is an important legal question, because of the protection afforded speech under the US Constitution. In my (legally uninformed) opinion digital money may make this issue more complicated. As we’ll see, the Bitcoin protocol is really a way of standing up before the rest of the world (or at least the rest of the Bitcoin network) and avowing “I’m going to give such-and-such a number of bitcoins to so-and-so a person” in a way that’s extremely difficult to repudiate. At least naively, it looks more like speech than exchanging copper coins, say.

I am delighted to announce that the first chapter of my book “Neural Networks and Deep Learning” is now freely available online here.

The chapter explains the basic ideas behind neural networks, including how they learn. I show how powerful these ideas are by writing a short program which uses neural networks to solve a hard problem — recognizing handwritten digits. The chapter also takes a brief look at how deep learning works.

Imagine you’re a programmer who is developing a new web browser. There are many malicious sites on the web, and you want your browser to warn users when they attempt to access dangerous sites. For example, suppose the user attempts to access http://domain/etc. You’d like a way of checking whether domain is known to be a malicious site. What’s a good way of doing this?

An obvious naive way is for your browser to maintain a list or set data structure containing all known malicious domains. A problem with this approach is that it may consume a considerable amount of memory. If you know of a million malicious domains, and domains need (say) an average of 20 bytes to store, then you need 20 megabytes of storage. That’s quite an overhead for a single feature in your web browser. Is there a better way?

In this post I’ll describe a data structure which provides an excellent way of solving this kind of problem. The data structure is known as a Bloom filter. Bloom filter are much more memory efficient than the naive “store-everything” approach, while remaining extremely fast. I’ll describe both how Bloom filters work, and also some extensions of Bloom filters to solve more general problems.

Most explanations of Bloom filters cut to the chase, quickly explaining the detailed mechanics of how Bloom filters work. Such explanations are informative, but I must admit that they made me uncomfortable when I was first learning about Bloom filters. In particular, I didn’t feel that they helped me understand why Bloom filters are put together the way they are. I couldn’t fathom the mindset that would lead someone to invent such a data structure. And that left me feeling that all I had was a superficial, surface-level understanding of Bloom filters.

In this post I take an unusual approach to explaining Bloom filters. We won’t begin with a full-blown explanation. Instead, I’ll gradually build up to the full data structure in stages. My goal is to tell a plausible story explaining how one could invent Bloom filters from scratch, with each step along the way more or less “obvious”. Of course, hindsight is 20-20, and such a story shouldn’t be taken too literally. Rather, the benefit of developing Bloom filters in this way is that it will deepen our understanding of why Bloom filters work in just the way they do. We’ll explore some alternative directions that plausibly could have been taken – and see why they don’t work as well as Bloom filters ultimately turn out to work. At the end we’ll understand much better why Bloom filters are constructed the way they are.

Of course, this means that if your goal is just to understand the mechanics of Bloom filters, then this post isn’t for you. Instead, I’d suggest looking at a more conventional introduction – the Wikipedia article, for example, perhaps in conjunction with an interactive demo, like the nice one here. But if your goal is to understand why Bloom filters work the way they do, then you may enjoy the post.

A stylistic note: Most of my posts are code-oriented. This post is much more focused on mathematical analysis and algebraic manipulation: the point isn’t code, but rather how one could come to invent a particular data structure. That is, it’s the story behind the code that implements Bloom filters, and as such it requires rather more attention to mathematical detail.

General description of the problem: Let’s begin by abstracting away from the “safe web browsing” problem that began this post. We want a data structure which represents a set of objects. That data structure should enable two operations: (1) the ability to add an extra object to the set; and (2) a test to determine whether a given object is a member of . Of course, there are many other operations we might imagine wanting – for example, maybe we’d also like to be able to delete objects from the set. But we’re going to start with just these two operations of adding and testing. Later we’ll come back and ask whether operations such as deleteing objects are also possible.

Idea: store a set of hashed objects: Okay, so how can we solve the problem of representing in a way that’s more memory efficient than just storing all the objects in ? One idea is to store hashed versions of the objects in , instead of the full objects. If the hash function is well chosen, then the hashed objects will take up much less memory, but there will be little danger of making errors when testing whether an object is an element of the set or not.

Let’s be a little more explicit about how this would work. We have a set of objects , where denotes the number of objects in . For each object we compute an -bit hash function – i.e., a hash function which takes an arbitrary object as input, and outputs bits – and the set is represented by the set . We can test whether is an element of by checking whether is in the set of hashes. This basic hashing approach requires roughly bits of memory.

(As an aside, in principle it’s possible to store the set of hashed objects more efficiently, using just bits, where is to base two. The saving is possible because the ordering of the objects in a set is redundant information, and so in principle can be eliminated using a suitable encoding. However, I haven’t thought through what encodings could be used to do this in practice. In any case, the saving is likely to be minimal, since and will usually be quite a bit bigger than – if that weren’t the case, then hash collisions would occur all the time. So I’ll ignore the terms for the rest of this post. In fact, in general I’ll be pretty cavalier in later analyses as well, omitting lower order terms without comment.)

A danger with this hash-based approach is that an object outside the set might have the same hash value as an object inside the set, i.e., for some . In this case, test will erroneously report that is in . That is, this data structure will give us a false positive. Fortunately, by choosing a suitable value for , the number of bits output by the hash function, we can reduce the probability of a false positive as much as we want. To understand how this works, notice first that the probability of test giving a false positive is 1 minus the probability of test correctly reporting that is not in . This occurs when for all . If the hash function is well chosen, then the probability that is for each , and these are independent events. Thus the probability of test failing is:

This expression involves three quantities: the probability of test giving a false positive, the number of bits output by the hash function, and the number of elements in the set, . It’s a nice expression, but it’s more enlightening when rewritten in a slightly different form. What we’d really like to understand is how many bits of memory are needed to represent a set of size , with probability of a test failing. To understand that we let be the number of bits of memory used, and aim to express as a function of and . Observe that , and so we can substitute for to obtain

This can be rearranged to express in term of and :

This expression answers the question we really want answered, telling us how many bits are required to store a set of size with a probability of a test failing. Of course, in practice we’d like to be small – say – and when this is the case the expression may be approximated by a more transparent expression:

This makes intuitive sense: test failure occurs when is not in , but is in the hashed version of . Because this happens with probability , it must be that occupies a fraction of the total space of possible hash outputs. And so the size of the space of all possible hash outputs must be about . As a consequence we need bits to represent each hashed object, in agreement with the expression above.

How memory efficient is this hash-based approach to representing ? It’s obviously likely to be quite a bit better than storing full representations of the objects in . But we’ll see later that Bloom filters can be far more memory efficient still.

The big drawback of this hash-based approach is the false positives. Still, for many applications it’s fine to have a small probability of a false positive. For example, false positives turn out to be okay for the safe web browsing problem. You might worry that false positives would cause some safe sites to erroneously be reported as unsafe, but the browser can avoid this by maintaining a (small!) list of safe sites which are false positives for test.

Idea: use a bit array: Suppose we want to represent some subset of the integers . As an alternative to hashing or to storing directly, we could represent using an array of bits, numbered through . We would set bits in the array to if the corresponding number is in , and otherwise set them to . It’s obviously trivial to add objects to , and to test whether a particular object is in or not.

The memory cost to store in this bit-array approach is bits, regardless of how big or small is. Suppose, for comparison, that we stored directly as a list of 32-bit integers. Then the cost would be bits. When is very small, this approach would be more memory efficient than using a bit array. But as gets larger, storing directly becomes much less memory efficient. We could ameliorate this somewhat by storing elements of using only 10 bits, instead of 32 bits. But even if we did this, it would still be more expensive to store the list once got beyond one hundred elements. So a bit array really would be better for modestly large subsets.

Idea: use a bit array where the indices are given by hashes: A problem with the bit array example described above is that we needed a way of numbering the possible elements of , . In general the elements of may be complicated objects, not numbers in a small, well-defined range.

Fortunately, we can use hashing to number the elements of . Suppose is an -bit hash function. We’re going to represent a set using a bit array containing elements. In particular, for each we set the th element in the bit array, where we regard as a number in the range . More explicitly, we can add an element to the set by setting bit number in the bit array. And we can test whether is an element of by checking whether bit number in the bit array is set.

This is a good scheme, but the test can fail to give the correct result, which occurs whenever is not an element of , yet for some . This is exactly the same failure condition as for the basic hashing scheme we described earlier. By exactly the same reasoning as used then, the failure probability is

As we did earlier, we’d like to re-express this in terms of the number of bits of memory used, . This works differently than for the basic hashing scheme, since the number of bits of memory consumed by the current approach is , as compared to for the earlier scheme. Using and substituting for in Equation [*], we have:

Rearranging this to express in term of and we obtain:

When is small this can be approximated by

This isn’t very memory efficient! We’d like the probability of failure to be small, and that makes the dependence bad news when compared to the dependence of the basic hashing scheme described earlier. The only time the current approach is better is when is very, very large. To get some idea for just how large, if we want , then is only better than when gets to be more than about . That’s quite a set! In practice, the basic hashing scheme will be much more memory efficient.

Intuitively, it’s not hard to see why this approach is so memory inefficient compared to the basic hashing scheme. The problem is that with an -bit hash function, the basic hashing scheme used bits of memory, while hashing into a bit array uses bits, but doesn’t change the probability of failure. That’s exponentially more memory!

At this point, hashing into bit arrays looks like a bad idea. But it turns out that by tweaking the idea just a little we can improve it a lot. To carry out this tweaking, it helps to name the data structure we’ve just described (where we hash into a bit array). We’ll call it a filter, anticipating the fact that it’s a precursor to the Bloom filter. I don’t know whether “filter” is a standard name, but in any case it’ll be a useful working name.

Idea: use multiple filters: How can we make the basic filter just described more memory efficient? One possibility is to try using multiple filters, based on independent hash functions. More precisely, the idea is to use filters, each based on an (independent) -bit hash function, . So our data structure will consist of separate bit arrays, each containing bits, for a grand total of bits. We can add an element by setting the th bit in the first bit array (i.e., the first filter), the th bit in the second filter, and so on. We can test whether a candidate element is in the set by simply checking whether all the appropriate bits are set in each filter. For this to fail, each individual filter must fail. Because the hash functions are independent of one another, the probability of this is the th power of any single filter failing:

The number of bits of memory used by this data structure is and so we can substitute and rearrange to get

Provided is much smaller than , this expression can be simplified to give

Good news! This repetition strategy is much more memory efficient than a single filter, at least for small values of . For instance, moving from repetitions to repititions changes the denominator from to – typically, a huge improvement, since is very small. And the only price paid is doubling the numerator. So this is a big win.

Intuitively, and in retrospect, this result is not so surprising. Putting multiple filters in a row, the probability of error drops exponentially with the number of filters. By contrast, in the single filter scheme, the drop in the probability of error is roughly linear with the number of bits. (This follows from considering Equation [*] in the limit where is small.) So using multiple filters is a good strategy.

Of course, a caveat to the last paragraph is that this analysis requires that , which means that can’t be too large before the analysis breaks down. For larger values of the analysis is somewhat more complicated. In order to find the optimal value of we’d need to figure out what value of minimizes the exact expression [**] for . We won’t bother – at best it’d be tedious, and, as we’ll see shortly, there is in any case a better approach.

Overlapping filters: This is a variation on the idea of repeating filters. Instead of having separate bit arrays, we use just a single array of bits. When adding an object , we simply set all the bits in the same bit array. To test whether an element is in the set, we simply check whether all the bits are set or not.

What’s the probability of the test failing? Suppose . Failure occurs when for some and , and also for some and , and so on for all the remaining hash functions, . These are independent events, and so the probability they all occur is just the product of the probabilities of the individual events. A little thought should convince you that each individual event will have the same probability, and so we can just focus on computing the probability that for some and . The overall probability of failure will then be the th power of that probability, i.e.,

The probability that for some and is one minus the probability that for all and . These are independent events for the different possible values of and , each with probability , and so

since there are different pairs of possible values . It follows that

Substituting we obtain

which can be rearranged to obtain

This is remarkably similar to the expression [**] derived above for repeating filters. In fact, provided is much smaller than , we get

which is exactly the same as [**] when is small. So this approach gives quite similar outcomes to the repeating filter strategy.

Which approach is better, repeating or overlapping filters? In fact, it can be shown that

and so the overlapping filter strategy is more memory efficient than the repeating filter strategy. I won’t prove the inequality here – it’s a straightforward (albeit tedious) exercise in calculus. The important takeaway is that overlapping filters are the more memory-efficient approach.

How do overlapping filters compare to our first approach, the basic hashing strategy? I’ll defer a full answer until later, but we can get some insight by choosing and . Then for the overlapping filter we get , while the basic hashing strategy gives . Basic hashing is worse whenever is more than about 100 million – a big number, but also a big improvement over the required by a single filter. Given that we haven’t yet made any attempt to optimize , this ought to encourage us that we’re onto something.

Problems for the author

I suspect that there’s a simple intuitive argument that would let us see upfront that overlapping filters will be more memory efficient than repeating filters. Can I find such an argument?

Bloom filters: We’re finally ready for Bloom filters. In fact, Bloom filters involve only a few small changes to overlapping filters. In describing overlapping filters we hashed into a bit array containing bits. We could, instead, have used hash functions with a range and hashed into a bit array of (instead of ) bits. The analysis goes through unchanged if we do this, and we end up with

and

exactly as before. The only reason I didn’t do this earlier is because in deriving Equation [*] above it was convenient to re-use the reasoning from the basic hashing scheme, where (not ) was the convenient parameter to use. But the exact same reasoning works.

What’s the best value of to choose? Put another way, what value of should we choose in order to minimize the number of bits, , given a particular value for the probability of error, , and a particular sizek ? Equivalently, what value of will minimize , given and ? I won’t go through the full analysis here, but with calculus and some algebra you can show that choosing

minimizes the probability . (Note that denotes the natural logarithm, not logarithms to base 2.) By choosing in this way we get:

This really is good news! Not only is it better than a bit array, it’s actually (usually) much better than the basic hashing scheme we began with. In particular, it will be better whenever

which is equivalent to requiring

If we want (say) this means that Bloom filter will be better whenever , which is obviously an extremely modest set size.

Another way of interpreting [***] is that a Bloom filter requires bits per element of the set being represented. In fact, it’s possible to prove that any data structure supporting the add and test operations will require at least bits per element in the set. This means that Bloom filters are near-optimal. Futher work has been done finding even more memory-efficient data structures that actually meet the bound. See, for example, the paper by Anna Pagh, Rasmus Pagh, and S. Srinivasa Rao.

Problems for the author

Are the more memory-efficient algorithms practical? Should we be using them?

In actual applications of Bloom filters, we won’t know in advance, nor . So the way we usually specify a Bloom filter is to specify the maximum size of set that we’d like to be able to represent, and the maximal probability of error, , that we’re willing to tolerate. Then we choose

and

This gives us a Bloom filter capable of representing any set up to size , with probability of error guaranteed to be at most . The size is called the capacity of the Bloom filter. Actually, these expressions are slight simplifications, since the terms on the right may not be integers – to be a little more pedantic, we choose

and

One thing that still bugs me about Bloom filters is the expression for the optimal value for . I don’t have a good intuition for it – why is it logarithmic in , and why does it not depend on ? There’s a tradeoff going on here that’s quite strange when you think about it: bit arrays on their own aren’t very good, but if you repeat or overlap them just the right number of times, then performance improves a lot. And so you can think of Bloom filters as a kind of compromise between an overlap strategy and a bit array strategy. But it’s really not at all obvious (a) why choosing a compromise strategy is the best; or (b) why the right point at which to compromise is where it is, i.e., why has the form it does. I can’t quite answer these questions at this point – I can’t see that far through Bloom filters. I suspect that understanding the case really well would help, but haven’t put in the work. Anyone with more insight is welcome to speak up!

Summing up Bloom filters: Let’s collect everything together. Suppose we want a Bloom filter with capacity , i.e., capable of representing any set containing up to elements, and such that test produces a false positive with probability at most . Then we choose

independent hash functions, . Each hash function has a range , where is the number of bits of memory our Bloom filter requires,

We number the bits in our Bloom filter from . To add an element to our set we set the bits in the filter. And to test whether a given element is in the set we simply check whether bits in the bit array are all set.

That’s all there is to the mechanics of how Bloom filters work! I won’t give any sample code – I usually provide code samples in Python, but the Python standard library lacks bit arrays, so nearly all of the code would be concerned with defining a bit array class. That didn’t seem like it’d be terribly illuminating. Of course, it’s not difficult to find libraries implementing Bloom filters. For example, Jason Davies has written a javascript Bloom filter which has a fun and informative online interactive visualisation. I recommend checking it out. And I’ve personally used Mike Axiak‘s fast C-based Python library pybloomfiltermmap – the documentation is clear, it took just a few minutes to get up and running, and I’ve generally had no problems.

Problems

Suppose we have two Bloom filters, corresponding to sets and . How can we construct the Bloom filters corresponding to the sets and ?

Applications of Bloom filters: Bloom filters have been used to solve many different problems. Here’s just a few examples to give the flavour of how they can be used. An early idea was Manber and Wu’s 1994 proposal to use Bloom filters to store lists of weak passwords. Google’s BigTable storage system uses Bloom filters to speed up queries, by avoiding disk accesses for rows or columns that don’t exist. Google Chrome uses Bloom filters to do safe web browsing – the opening example in this post was quite real! More generally, it’s useful to consider using Bloom filters whenever a large collection of objects needs to be stored. They’re not appropriate for all purposes, but at the least it’s worth thinking about whether or not a Bloom filter can be applied.

Extensions of Bloom filters: There’s many clever ways of extending Bloom filters. I’ll briefly describe one, just to give you the flavour, and provide links to several more.

A delete operation: It’s possible to modify Bloom filters so they support a delete operation that lets you remove an element from the set. You can’t do this with a standard Bloom filter: it would require unsetting one or more of the bits in the bit array. This could easily lead us to accidentally deleteother elements in the set as well.

Instead, the delete operation is implemented using an idea known as a counting Bloom filter. The basic idea is to take a standard Bloom filter, and replace each bit in the bit array by a bucket containing several bits (usually 3 or 4 bits). We’re going to treat those buckets as counters, initially set to . We add an element to the counting Bloom filter by incrementing each of the buckets numbered . We test whether is in the counting Bloom filter by looking to see whether each of the corresponding buckets are non-zero. And we delete by decrementing each bucket.

This strategy avoids the accidental deletion problem, because when two elements of the set and hash into the same bucket, the count in that bucket will be at least . deleteing one of the elements, say , will still leave the count for the bucket at least , so won’t be accidentally deleted. Of course, you could worry that this will lead us to erroneously conclude that is still in the set after it’s been deleted. But that can only happen if other elements in the set hash into every single bucket that hashes into. That will only happen if is very large.

Of course, that’s just the basic idea behind counting Bloom filters. A full analysis requires us to understand issues such as bucket overflow (when a counter gets incremented too many times), the optimal size for buckets, the probability of errors, and so on. I won’t get into that, but you there’s details in the further reading, below.

Other variations and further reading: There are many more variations on Bloom filters. Just to give you the flavour of a few applications: (1) they can be modified to be used as lookup dictionaries, associating a value with each element added to the filter; (2) they can be modified so that the capacity scales up dynamically; and (3) they can be used to quickly approximate the number of elements in a set. There are many more variations as well: Bloom filters have turned out to be a very generative idea! This is part of why it’s useful to understand them deeply, since even if a standard Bloom filter can’t solve the particular problem you’re considering, it may be possible to come up with a variation which does. You can get some idea of the scope of the known variations by looking at the Wikipedia article. I also like the survey article by Andrei Broder and Michael Mitzenmacher. It’s a little more dated (2004) than the Wikipedia article, but nicely written and a good introduction. For a shorter introduction to some variations, there’s also a recent blog post by Matthias Vallentin. You can get the flavour of current research by looking at some of the papers citing Bloom filters here. Finally, you may enjoy reading the original paper on Bloom filters, as well as the original paper on counting Bloom filters.

Understanding data structures: I wrote this post because I recently realized that I didn’t understand any complex data structure in any sort of depth. There are, of course, a huge number of striking data structures in computer science – just look at Wikipedia’s amazing list! And while I’m familiar with many of the simpler data structures, I’m ignorant of most complex data structures. There’s nothing wrong with that – unless one is a specialist in data structures there’s no need to master a long laundry list. But what bothered me is that I hadn’t thoroughly mastered even a single complex data structure. In some sense, I didn’t know what it means to understand a complex data structure, at least beyond surface mechanics. By trying to reinvent Bloom filters, I’ve found that I’ve deepened my own understanding and, I hope, written something of interest to others.

More precisely, I crawled 250,113,669 pages for just under 580 dollars in 39 hours and 25 minutes, using 20 Amazon EC2 machine instances.

I carried out this project because (among several other reasons) I wanted to understand what resources are required to crawl a small but non-trivial fraction of the web. In this post I describe some details of what I did. Of course, there’s nothing especially new: I wrote a vanilla (distributed) crawler, mostly to teach myself something about crawling and distributed computing. Still, I learned some lessons that may be of interest to a few others, and so in this post I describe what I did. The post also mixes in some personal working notes, for my own future reference.

What does it mean to crawl a non-trivial fraction of the web? In fact, the notion of a “non-trivial fraction of the web” isn’t well defined. Many websites generate pages dynamically, in response to user input – for example, Google’s search results pages are dynamically generated in response to the user’s search query. Because of this it doesn’t make much sense to say there are so-and-so many billion or trillion pages on the web. This, in turn, makes it difficult to say precisely what is meant by “a non-trivial fraction of the web”. However, as a reasonable proxy for the size of the web we can use the number of webpages indexed by large search engines. According to this presentation by Googler Jeff Dean, as of November 2010 Google was indexing “tens of billions of pages”. (Note that the number of urls is in the trillions, apparently because of duplicated page content, and multiple urls pointing to the same content.) The now-defunct search engine Cuil claimed to index 120 billion pages. By comparison, a quarter billion is, obviously, very small. Still, it seemed to me like an encouraging start.

Code: Originally I intended to make the crawler code available under an open source license at GitHub. However, as I better understood the cost that crawlers impose on websites, I began to have reservations. My crawler is designed to be polite and impose relatively little burden on any single website, but could (like many crawlers) easily be modified by thoughtless or malicious people to impose a heavy burden on sites. Because of this I’ve decided to postpone (possibly indefinitely) releasing the code.

There’s a more general issue here, which is this: who gets to crawl the web? Relatively few sites exclude crawlers from companies such as Google and Microsoft. But there are a lot of crawlers out there, many of them without much respect for the needs of individual siteowners. Quite reasonably, many siteowners take an aggressive approach to shutting down activity from less well-known crawlers. A possible side effect is that if this becomes too common at some point in the future, then it may impede the development of useful new services, which need to crawl the web. A possible long-term solution may be services like Common Crawl, which provide access to a common corpus of crawl data.

The master machine (my laptop) begins by downloading Alexa’s list of the top million domains. These were used both as a domain whitelist for the crawler, and to generate a starting list of seed urls.

The domain whitelist was partitioned across the 20 EC2 machine instances in the crawler. This was done by numbering the instances and then allocating the domain domain to instance number hash(domain) % 20, where hash is the standard Python hash function.

Deployment and management of the cluster was handled using Fabric, a well-documented and nicely designed Python library which streamlines the use of ssh over clusters of machines. I managed the connection to Amazon EC2 using a set of Python scripts I wrote, which wrap the boto library.

I used 20 Amazon EC2 extra large instances, running Ubuntu 11.04 (Natty Narwhal) under the ami-68ad5201 Amazon machine image provided by Canonical. I used the extra large instance after testing on several instance types; the extra large instances provided (marginally) more pages downloaded per dollar spent. I used the US East (North Virginia) region, because it’s the least expensive of Amazon’s regions (along with the US West, Oregon region).

Single instance architecture: Each instance further partitioned its domain whitelist into 141 separate blocks of domains, and launched 141 Python threads, with each thread responsible for crawling the domains in one block. Here’s how it worked (details below):

The reason for using threads is that the Python standard library uses blocking I/O to handle http network connections. This means that a single-threaded crawler would spend most of its time idling, usually waiting on the network connection of the remote machine being crawled. It’s much better to use a multi-threaded crawler, which can make fuller use of the resources available on an EC2 instance. I chose the number of crawler threads (141) empirically: I kept increasing the number of threads until the speed of the crawler started to saturate. With this number of threads the crawler was using a considerable fraction of the CPU capacity available on the EC2 instance. My informal testing suggested that it was CPU which was the limiting factor, but that I was not so far away from the network and disk speed becoming bottlenecks; in this sense, the EC2 extra large instance was a good compromise. Memory useage was never an issue. It’s possible that for this reason EC2’s high-CPU extra large instance type would have been a better choice; I only experimented with this instance type with early versions of the crawler, which were more memory-limited.

How domains were allocated across threads: The threads were numbered , and domains were allocated on the basis of the Python hash function, to thread number hash(domain) % 141 (similar to the allocation across machines in the cluster). Once the whitelisted domains / seed urls were allocated to threads, the crawl was done in a simple breadth-first fashion, i.e., for each seed url we download the corresponding web page, extract the linked urls, and check each url to see: (a) whether the extracted url is a fresh url which has not already been seen and added to the url frontier; and (b) whether the extracted url is in the same seed domain as the page which has just been crawled. If both these conditions are met, the url is added to the url frontier for the current thread, otherwise the url is discarded. With this architecture we are essentially carrying out a very large number of independent crawls of the whitelisted domains obtained from Alexa.

Note that this architecture also ensures that if, for example, we are crawling a page from TechCrunch, and extract from that page a link to the Huffington Post, then the latter link will be discarded, even though the Huffington Post is in our domain whitelist. The only links added to the url frontier will be those that point back to TechCrunch itself. The reason we avoid adding dealing with (whitelisted) external links is because: (a) it may require communication between different EC2 instances, which would substantially complicate the crawler; and, more importantly, (b) in practice, most sites have lots of internal links, and so it’s unlikely that this policy means the crawler is missing much.

One advantage of allocating all urls from the same domain to the same crawler thread is that it makes it much easier to crawl politely, since no more than one connection to a site will be open at any given time. In particular, this ensures that we won’t be hammering any given domain with many simultaneous connections from different threads (or different machines).

Problems for the author

For some very large and rapidly changing websites it may be necessary to open multiple simultaneous connections in order for the crawl to keep up with the changes on the site. How can we decide when that is appropriate?

How the url frontiers work: A separate url frontier file was maintained for each domain. This was simply a text file, with each line containing a single url to be crawled; initially, the file contains just a single line, with the seed url for the domain. I spoke above of the url frontier for a thread; that frontier can be thought of as the combination of all the url frontier files for domains being crawled by that thread.

Each thread maintained a connection to a redis server. For each domain being crawled by the thread a redis key-value pair was used to keep track of the current position in the url frontier file for that domain. I used redis (and the Python bindings) to store this information in a fashion that was both persistent and fast to look up. The persistence was important because it meant that the crawler could be stopped and started at will, without losing track of where it was in the url frontier.

Each thread also maintained a dictionary whose keys were the (hashed) domains for that thread. The corresponding values were the next time it would be polite to crawl that domain. This value was set to be 70 seconds after the last time the domain was crawled, to ensure that domains weren’t getting hit too often. The crawler thread simply iterated over the keys in this dictionary, looking for the next domain it was polite to crawl. Once it found such a domain it then extracted the next url from the url frontier for that domain, and went about downloading that page. If the url frontier was exhausted (some domains run out of pages to crawl) then the domain key was removed from the dictionary. One limitation of this design was that when restarting the crawler each thread had to identify again which domains had already been exhausted and should be deleted from the dictionary. This slowed down the restart a little, and is something I’d modify if I were to do further work with the crawler.

Use of a Bloom filter: I used a Bloom filter to keep track of which urls had already been seen and added to the url frontier. This enabled a very fast check of whether or not a new candidate url should be added to the url frontier, with only a low probability of erroneously adding a url that had already been added. This was done using Mike Axiak‘s very nice C-based pybloomfiltermmap.

Update: Jeremy McLain points out in comments that I’ve got this backward, and that with a Bloom filter there is a low probability “that you will never crawl certain URLs because your bloom filter is telling you they have already been crawled when in fact they have not.” A better (albeit slightly slower) solution would be to simply store all the URLs, and check directly.

Anticipated versus unanticipated errors: Because the crawler ingests input from external sources, it needs to deal with many potential errors. By design, there are two broad classes of error: anticipated errors and unanticipated errors.

Anticipated errors are things like a page failing to download, or timing out, or containing unparseable input, or a robots.txt file disallowing crawling of a page. When anticipated errors arise, the crawler writes the error to a (per-thread) informational log (the “info log” in the diagram above), and continues in whatever way is appropriate. For example, if the robots.txt file disallows crawling then we simply continue to the next url in the url frontier.

Unanticipated errors are errors which haven’t been anticipated and designed for. Rather than the crawler falling over, the crawler simply logs the error (to the “critical log” in the diagram above), and moves on to the next url in the url frontier. At the same time, the crawler tracks how many unanticipated errors have occurred in close succession. If many unanticipated errors occur in close succession it usually indicates that some key piece of infrastructure has failed. Because of this, if there are too many unanticipated errors in close succession, the crawler shuts down entirely.

As I was developing and testing the crawler, I closely followed the unanticipated errors logged in the critical log. This enabled me to understand many of the problems faced by the crawler. For example, early on in development I found that sometimes the html for a page would be so badly formed that the html parser would have little choice but to raise an exception. As I came to understand such errors I would rewrite the crawler code so such errors become anticipated errors that were handled as gracefully as possible. Thus, the natural tendency during development was for unanticipated errors to become anticipated errors.

Domain and subdomain handling: As mentioned above, the crawler works by doing lots of parallel intra-domain crawls. This works well, but a problem arises because of the widespread use of subdomains. For example, if we start at the seed url http://barclays.com and crawl only urls within the barclays.com domain, then we quickly run out of urls to crawl. The reason is that most of the internal links on the barclays.com site are actually to group.barclays.com, not barclays.com. Our crawler should also add urls from the latter domain to the url frontier for barclays.com.

We resolve this by stripping out all subdomains, and working with the stripped domains when deciding whether to add a url to the url frontier. Removing subdomains turns out to be a surprisingly hard problem, because of variations in the way domain names are formed. Fortunately, the problem seems to be well solved using John Kurkowski’stldextract library.

On the representation of the url frontier: I noted above that a separate url frontier file was maintained for each domain. In an early version of the code, each crawler thread had a url frontier maintained as a single flat text file. As a crawler thread read out lines in the file, it would crawl those urls, and append any new urls found to the end of the file.

This approach seemed natural to me, but organizing the url frontier files on a per-thread (rather than per-domain) basis caused a surprising number of problems. As the crawler thread moved through the file to find the next url to crawl, the crawler thread would encounter urls belonging to domains that were not yet polite to crawl because they’d been crawled too recently. My initial strategy was simply to append such urls to the end of the file, so they would be found again later. Unfortunately, there were often a lot of such urls in a row – consecutive urls often came from the same domain (since they’d been extracted from the same page). And so this strategy caused the file for the url frontier to grow very rapidly, eventually consuming most disk space.

Exacerbating this problem, this approach to the url frontier caused an unforseen “domain clumping problem”. To understand this problem, imagine that the crawler thread encountered (say) 20 consecutive urls from a single domain. It might crawl the first of these, extracting (say) 20 extra urls to append to the end of the url frontier. But the next 19 urls would all be skipped over, since it wouldn’t yet be polite to crawl them, and they’d also be appended to the end of the url frontier. Now we have 39 urls from the same domain at the end of the url frontier. But when the crawler thread gets to those, we may well have the same process repeat – leading to a clump of 58 urls from the same domain at the end of the file. And so on, leading to very long runs of urls from the same domain. This consumes lots of disk space, and also slows down the crawler, since the crawler thread may need to examine a large number of urls before it finds a new url it’s okay to crawl.

These problems could have been solved in various ways; moving to the per-domain url frontier file was how I chose to address the problems, and it seemed to work well.

Choice of number of threads: I mentioned above that the number of crawler threads (141) was chosen empirically. However, there is an important constraint on that number, and in particular its relationship to the number (20) of EC2 instances being used. Suppose that instead of 141 threads I’d used (say) 60 threads. This would create a problem. To see why, note that any domain allocated to instance number 7 (say) would necessarily satisfy hash(domain) % 20 = 7. This would imply that hash(domain) % 60 = 7 or 27 or 47, and as a consequence all the domains would be allocated to just one of three crawler threads (thread numbers 7, 27 and 47), while the other 57 crawler threads would lie idle, defeating the purpose of using multiple threads.

One way to solve this problem would be to use two independent hash functions to allocate domains to EC2 instances and crawler threads. However, an even simpler way of solving the problem is to choose the number of crawler threads to be coprime to the number of EC2 instances. This coprimality ensures that domains will be allocated reasonably evenly across both instance and threads. (I won’t prove this here, but it can be proved with a little effort). It is easily checked that 141 and 20 are coprime.

Note, incidentally, that Python’s hash is not a true hash function, in the sense that it doesn’t guarantee that the domains will be spread evenly across EC2 instances. It turns out that Python’s hash takes similar key strings to similar hash values. I talk more about this point (with examples) in the fifth paragraph of this post. However, I found empirically that hash seems to spread domains evenly enough across instances, and so I didn’t worry about using a better (but slower) hash function, like those available through Python’s hashlib library.

Use of Python: All my code was written in Python. Initially, I wondered if Python might be too slow, and create bottlenecks in the crawling. However, profiling the crawler showed that most time was spent either (a) managing network connections and downloading data; or (b) parsing the resulting webpages. The parsing of the webpages was being done using lxml, a Python binding to fast underlying C libraries. It didn’t seem likely to be easy to speed that up, and so I concluded that Python was likely not a particular bottleneck in the crawling.

Politeness: The crawler used Python’s robotparser library in order to observe the robots exclusion protocol. As noted above, I also imposed an absolute 70-second minimum time interval between accesses to any given domain. In practice, the mean time between accesses was more like 3-4 minutes.

In initial test runs of the crawler I got occasional emails from webmasters asking for an explanation of why I was crawling their site. Because of this, in the crawler’s User-agent I included a link to a webpage explaining the purpose of my crawler, how to exclude it from a site, and what steps I was taking to crawl politely. This was (I presume) both helpful to webmasters and also helpful to me, for it reduced the number of inquiries. A handful of people asked me to exclude their sites from the crawl, and I complied quickly.

Problems for the author

Because my crawl didn’t take too long, the robots.txt file was downloaded just once for each domain, at the beginning of the crawl. In a longer crawl, how should we decide how long to wait between downloads of robots.txt?

Truncation: The crawler truncates large webpages rather than downloading the full page. It does this in part because it’s necessary – it really wouldn’t surprise me if someone has a terabyte html file sitting on a server somewhere – and in part because for many applications it will be of more interest to focus on earlier parts of the page.

What’s a reasonable threshold for truncation? According to this report from Google, as of May 2010 the average network size of a webpage from a top site is 312.04 kb. However, that includes images, scripts and stylesheets, which the crawler ignores. If you ignore the images and so on, then the average network size drops to just 33.66 kb.

However, that number of 33.66 kb is for content which may be served compressed over the network. Our truncation will be based on the uncompressed size. Unfortunately, the Google report doesn’t tell us what the average size of the uncompressed content is. However, we can get an estimate of this, since Google reports that the average uncompressed size of the total page (including images and so on) is 477.26 kb, while the average network size is 312.04 kb.

Assuming that this compression ratio is typical, we estimate that the average uncompressed size of the content the crawler downloads is 51 kb. In the event, I experimented with several truncation settings, and found that a truncation threshold of 200 kilobytes enabled me to download the great majority of webpages in their entirety, while addressing the problem of very large html files mentioned above. (Unfortunately, I didn’t think to check what the actual average uncompressed size was, my mistake.)

Storage: I stored all the data using EC2’s built-in instance storage – 1.69 Terabytes for the extra-large instances I was using. This storage is non-persistent, and so any data stored on an instance will vanish when that instance is terminated. Now, for many kinds of streaming or short-term analysis of data this would be adequate – indeed, it might not even be necessary to store the data at all. But, of course, for many applications of a crawl this approach is not appropriate, and the instance storage should be supplemented with something more permanent, such as S3. For my purposes using the instance storage seemed fine.

Price: The price broke down into two components: (1) 512 dollars for the use of the 20 extra-large EC2 instances for 40 hours; and (2) about 65 dollars for a little over 500 gigabytes of outgoing bandwidth, used to make http requests. Note that Amazon does not charge for incoming bandwidth (a good thing, too!) It would be interesting to compare these costs to the (appropriately amortized) costs of using other cloud providers, or self-hosting.

Something I didn’t experiment with is the use of Amazon’s spot instances, where you can bid to use Amazon’s unused EC2 capacity. I didn’t think of doing this until just as I was about to launch the crawl. When I went to look at the spot instance pricing history, I discovered to my surprise that the spot instance prices are often a factor of 10 or so lower than the prices for on-demand instances! Factoring in the charges for outgoing bandwidth, this means it may be possible to use spot instances to do a similar crawl for 120 dollars or so, a factor of five savings. I considered switching, but ultimately decided against it, thinking that it might take 2 or 3 days work to properly understand the implications of switching, and to get things working exactly as I wanted. Admittedly, it’s possible that it would have taken much less time, in which case I missed an opportunity to trade some money for just a little extra time.

Improvements to the crawler architecture: Let me finish by noting a few ways it’d be interesting to improve the current crawler:

For many long-running applications the crawler would need a smart crawl policy so that it knows when and how to re-crawl a page. According to a presentation from Jeff Dean, Google’s mean time to index a new page is now just minutes. I don’t know how that works, but imagine that notification protocols such as pubsubhubbub play an important role. It’d be good to change the crawler so that it’s pubsubhubbub aware.

The crawler currently uses a threaded architecture. Another quite different approach is to use an evented architecture. What are the pros and cons of a multi-threaded versus an evented architecture?

The instances in the cluster are configured using fabric and shell scripts to install programs such as redis, pybloomfilter, and so on. This is slow and not completely reliable. Is there a better way of doing this? Creating my own EC2 AMI? Configuration management software such as Chef and Puppet? I considered using one of the latter, but deferred it because of the upfront cost of learning the systems.

Logging is currently done using Python’s logging module. Unfortunately, I’m finding this is not well-adapted to Python’s threading. Is there a better solution?

The crawler was initially designed for crawling in a batch environment, where it is run and then terminates. I’ve since modified it so that it can be stopped, modifications made, and restarted. It’d be good to add instrumentation so it can be modified more dynamically, in real time.

Many interesting research papers have been published about crawling. I read or skimmed quite a few while writing my crawler, but ultimately used only a few of the ideas; just getting the basics right proved challenging enough. In future iterations it’d be useful to look at this work again and to incorporate the best ideas. Good starting points include a chapter in the book by Manning, Raghavan and Sch\”utze, and a survey paper by Olston and Najork. Existing open source crawlers such as Heritrix and Nutch would also be interesting to look at in more depth.

It’s tempting to think IBM’s Jeopardy-playing machine Watson must have relied on some huge algorithmic advance or silver bullet idea in order to beat top human Jeopardy players. But the researchers behind Watson have written a very interesting paper about how Watson works, and a different picture emerges. It’s not that they found any super-algorithm for answering questions. Instead, Watson combined a large number of different algorithms, most of them variations on standard algorithms from natural language processing and machine learning. Individually, none of these algorithms was particularly good at solving Jeopardy puzzles, or the sub-problems that needed to be solved along the way. But by integrating a suite of not-very-good algorithms in just the right way, the Watson team got superior performance. As they write in the paper:

rapid integration and evaluation of new ideas and new components against end-to-end metrics [were] essential to our progress… [Question answering benefits from] a single extensible architecture that [allows] component results to be consistently evaluated in a common technical context against a growing variety of… “Challenge Problems.”… Our commitment to regularly evaluate the effects of specific techniques on end-to-end-performance, and to let that shape our research investment, was necessary for our rapid progress.

In other words, they built an extensive evaluation environment that gave detailed and clear metrics that let them see how well their system was doing. In turn, they used these metrics to determine where to invest time and money in improving their system and, equally important, to determine what ideas to abandon. We will call this style of development evaluation-driven development. It’s not a new idea, of course – anyone who’s ever run an A/B test is doing something similar – but the paper implies that Watson took the approach to a remarkable extreme, and that it was this approach which was responsible for much of Watson’s success.

In the last post I developed a very simple question-answering system based on the AskMSR system developed at Microsoft Research in 2001. I’ve since been experimenting – in a very small-scale way! – with the use of evaluation-driven development to improve the performance of that system. In this post I describe some of my experiments. Of course, my experiments don’t compare in sophistication to what the Watson team did, nor with what is done in many other machine learning systems. Still, I hope that the details are of interest to at least a few readers. The code for the experiments may be found on GitHub here. Note that you don’t need to have read the last post to understand this one, although of course it wouldn’t hurt!

First evaluation: The system described in my last post took questions like “Who is the world’s richest person?” and rewrote those questions as search queries for Google. E.g., for the last question it might use Google to find webpages matching “the world’s richest person is *”, and then look for name-like strings in the wildcard position. It would extract possible strings, and score them based on a number of factors, such as whether the strings were capitalized, how likely the search query was to give the correct results, and so on. Finally, it returned the highest-scoring strings as candidate answers.

To evaluate this system, I constructed a list of 100 questions, with each question having a list of one or more acceptable answers. Example questions and acceptable answers included:

(1) Who married the Prince of Wales in 2005?: Camilla Parker Bowles

(2) Who was the first woman to climb the mountain K2?: Wanda Rutkiewicz

(3) Who was the bass guitarist for the band Nirvana?: Krist Novoselic

(4) Who wrote ‘The Waste Land’?: T. S. Eliot or Thomas Stearns Eliot

(5) Who won the 1994 Formula One Grand Prix championship?: Michael Schumacher

The system described in my last post got 41 of the 100 questions exactly right, i.e., it returned an acceptable answer as its top-scoring answer. Furthemore, the system returned a correct answer as one of its top 20 scored responses for 75 of the 100 questions. Among these, the average rank was 3.48: i.e., when Google could find the answer at all, it tended to rank it very highly.

(Actually, to be precise, I used a slightly modified version of the system in the last post: I fixed a bug in how the program parsed strings. The bug is conceptually unimportant, but fixing it increased performance slightly.)

Comparison to Wolfram Alpha: As a comparison, I tried submitting the evaluation questions also to Wolfram Alpha, and determining if the answers returned by Wolfram were in the list of acceptable answers. Now, Wolfram doesn’t always return an answer. But for 27 of those questions it did return an answer. And 20 of those answers were correct.

Hybrid system: Two things are notable about Wolfram’s performance versus the Google-based system: (1) Wolfram gets answers correct much less often than the Google-based system; and (2) When Wolfram returns an answer, it is much more likely to be correct (20 / 27, or 74 percent) than the Google-based system. This suggests using a hybrid system which applies the following procedure:

Let’s see if we can guess how well this system will work. One possible assumption is that the correctness of the Google-based system and the Wolfram-based system are uncorrelated. If that’s the case, then we’d expect that the hybrid system would get 20 / 27 questions correct, for those questions answered by Wolfram, and the Google-based system would get 41 percent of the remaining 73 questions correct. That gives a total of 50 questions correct.

Now, in practice, when I ran the hybrid procedure it only gave 45 correct answers. That’s a considerable improvement (4 questions) over the Google-based system alone. On the other hand, it’s less than half the improvement we’d expect if the Google-based system and the Wolfram-based system were uncorrelated. What’s going on is that the evaluation questions which Wolfram Alpha is good at answering are, for the most part, questions which Google is also good at answering. In other words, what we learn from the evaluation is that there is considerable overlap (or correlation) in the type of questions these two systems are good at answering, and so we get less improvement than we might have thought.

Problems for the author

In developing the hybrid algorithm we relied on the fact that when Wolfram Alpha returned an answer it was highly likely to be correct – much more likely than the Google-based system. But suppose the Google-based system says that there is a very big gap in score between the top-ranked and second-ranked answer. Might that signal that we can have high confidence in the Google-based answer? Determine where it is possible to gain better performance by returning the Google-based answer preferentially when the ratio of the top-ranked and second-ranked score is sufficiently large.

Are there other problem features that would enable us to determine whether a question is more or less likely to be answered correctly by the Google-based system or by the Wolfram-based system? Maybe, for example, long questions are more likely to be answered correctly by Google. Modify the code to determine whether this is the case.

Improving the scoring of proper nouns in the Google-based system: In the Google-based system, the scoring mechanism increase the scores of candidate answers by a factor for each word in the answer which is capitalized. The idea is to score more highly candidate answers which are likely to be proper nouns.

Originally, I chose the capitalization factor to be 3.0, based on nothing more than an intuitive guess. Let’s use the evaluation system to see if we can improve this guess. The results are below; you may wish to skip straight to the list of lessons learned, however, as an entree into the results.

Factor: 4.0
Top-ranking answer is correct: 33
A correct answer in the top 20: 74
Average rank for correct answers in the top 20: 4.46
Factor: 3.0 (original parameter choice)
Top-ranking answer is correct: 41
A correct answer in the top 20: 75
Average rank for correct answers in the top 20: 3.48
Factor: 2.5
Top-ranking answer is correct: 42
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.41
Factor: 2.4
Top-ranking answer is correct: 47
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.38
Factor: 2.3
Top-ranking answer is correct: 47
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.25
Factor: 2.2
Top-ranking answer is correct: 47
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.21
Factor: 2.1
Top-ranking answer is correct: 47
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.24
Factor: 2.0
Top-ranking answer is correct: 42
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.13
Factor: 1.9
Top-ranking answer is correct: 44
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.13
Factor: 1.8
Top-ranking answer is correct: 43
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.12
Factor: 1.7
Top-ranking answer is correct: 38
A correct answer in the top 20: 76
Average rank for correct answers in the top 20: 3.39
Factor: 1.6
Top-ranking answer is correct: 34
A correct answer in the top 20: 74
Average rank for correct answers in the top 20: 3.22
Factor: 1.5
Top-ranking answer is correct: 29
A correct answer in the top 20: 73
Average rank for correct answers in the top 20: 3.38
Factor: 1.0
Top-ranking answer is correct: 1
A correct answer in the top 20: 65
Average rank for correct answers in the top 20: 7.74
Factor: 0.8
Top-ranking answer is correct: 1
A correct answer in the top 20: 52
Average rank for correct answers in the top 20: 10.25

Looking at the above, we learn a great deal:

(1) Increasing the capitalization factor to much above my original guess at a value (3.0) starts to significantly degrade the system.

(2) Removing the capitalization factor (i.e., setting it to 1.0) makes the system much, much worse. It still does okay at getting things approximately correct — for 65 of the questions a correct answer is in the top 20 returned responses. But for only 1 of the 100 questions is the top-ranked answer actually correct. In other words, we learn that paying attention to capitalization makes a huge difference to how likely our algorithm is to return a correct answer as the top result. It’s also interesting to note that even ignoring capitalization the algorithm does pretty well at returning a high-ranked answer that is correct.

(3) Actually penalizing capitalized answers (by choosing a capitalization factor of 0.8) makes things even worse. This is not surprising, but it’s useful to see explicitly, and further confirmation that paying attention to capitalization matters.

(4) The optimal parameter range for the capitalization factor depends on whether you want to maximize the number of perfect answers, in which case the capitalization factor should be 2.1-2.4, or to get the best possible rank, in which case the capitalization factor should be 1.8-2.0.

(5) Performance only varies a small amount over the range of capitalization factors 1.8-2.3. Even taking the capitalization factor out to its original value (3.0) decreases performance only a small amount.

Improving the scoring of proper nouns in the hybrid system: We’ve just used our evaluation suite to improve the value of the capitalization factor for the Google-based question-answering system. What happens when go through the same procedure for the hybrid question-answering system? A priori the optimal value for the capitalization factor may be somewhat different. I ran the evaluation algorithm for many different values of the capitaliation factor. Here are the results:

The pattern is quite similar to the Google-based system, with the highest scores being obtained for capitalization factors in the range 2.1-2.4. As a result, I’ve change the value for the capitalization factor from its original value of 3.0 to a somewhat lower value, 2.2.

Varying the weights of the rewrite rules: Recall from the last post that questions of the form “Who VERB w1 w2 w3…” got rewritten as Google queries:

"VERB w1 w2 w3 ... "
"w1 VERB w2 w3 ... "
"w1 w2 VERB w3 ... "
...

The candidate answers extracted from the search result for these rewritten queries were given a score of 5, following the scoring used in the original AskMSR research paper. Another rewrite rule formulated a single Google query which omitted both the quotes and VERB, i.e., it was just:

w1 w2 w3 ...

The candidate answers extracted from the search results for these rewritten queries were given a score of 2.

The scores 5 and 2 seem pretty arbitrary. Can we improve our system by changing the scores? Let’s experiment, and find out. To do the experiment, let me first mention that all that matters to the final ranking is the ratio of the two scores (initially, 5/2 = 2.5). So I’ll give the results of running our evaluation suite for different choices of that ratio:

(3) Dropping the unquoted rewritten queries has almost no effect at all, dropping the number of correct answers from 51 to 50.

(4) Choosing the ratio to be 2.5, as was done in the original AskMSR paper, seems to provide good performance.

Two problems in the way we’ve used evaluation questions to improve the system

(1) It’s not clear that the evaluation questions are truly a random sample of well-posed “Who…?” questions. They’re simply a set of questions that happened to occur to me while I was working on the program, and they naturally reflect my biases and interests. In fact, the problem is worse than that: it’s not even clear what it would mean for the evaluation questions to be a random sample. A random sample from what natural population?

(2) There’s something fishy about using the same data repeatedly to improve the parameters in our system. It can lead to the problem of overfitting, where the parameters we end up choosing reflect accidental features of the evaluation set. A standard solution to this problem is cross-validation, dividing evaluation questions up into training data (which are used to estimate parameters) and validation data (which are used to check that overfitting hasn’t occurred).

Problem (2) could be solved with a much larger set of evaluation questions, and that’s a good goal for the future. If anyone knows of a good source of evaluation questions, I’d like to hear about it! It’s not at all clear to me how to solve problem (1), and I’d be interested to hear people’s ideas.

Automating: What we’ve done here is very similar to what people routinely do with statistical models or machine learning, where you use training data to find the best possible parameters for your model. What’s interesting, though, is that we’re not just varying parameters, we’re changing the algorithm. It’s interesting to learn that Wolfram Alpha and Google overlap more than you might a priori expect in the types of questions they’re good (and bad) at answering. And it tells you that effort invested in improving one or the other approach might have less impact than you expect, unless you’re careful to aim your improvements at questions not already well-answered by the other system.

Problems

Find three ways of modifying the hybrid algorithm, and use the evaluation algorithm to see how well the modifications work. Can you improve the algorithm so it gets more than 51 of the evaluation questions exactly right?

Acknowledgements: Thanks to Nathan Hoffman for gently suggesting that it might be a good idea to cache results from my screen-scraping!

In 2011, IBM got a lot of publicity when it demonstrated a computer named Watson, which was designed to answer questions on the game show Jeopardy. Watson was good enough to be competitive with (and ultimately better than) some of the best ever Jeopardy players.

Playing Jeopardy perhaps seems like a frivolous task, and it’s tempting to think that Watson was primarily a public relations coup for IBM. In fact, Watson is a remarkable technological achievement. Jeopardy questions use all sorts of subtle hints, humour and obscure references. For Watson to compete with the best humans it needed to integrate an enormous range of knowledge, as well as many of the best existing ideas in natural language processing and artificial intelligence.

In this post I describe one of Watson’s ancestors, a question-answering system called AskMSR built in 2001 by researchers at Microsoft Research. AskMSR is much simpler than Watson, and doesn’t work nearly as well, but does teach some striking lessons about question-answering systems. My account of AskMSR is based on a paper by Brill, Lin, Banko, Dumais and Ng, and a followup paper by Brill, Dumais and Banko.

The AskMSR system was developed for a US-Government-run workshop called TREC. TREC is an annual multi-track workshop, with each track concentrating on a different information retrieval task. For each TREC track the organizers pose a challenge to participants in the run-up to the workshop. Participants build systems to solve those challenges, the systems are evaluated by the workshop organizers, and the results of the evaluation are discussed at the workshop.

For many years one of the TREC tracks was question answering (QA), and the AskMSR system was developed for the QA track at the 2002 TREC workshop. At the time, many of the systems submitted to TREC’s QA track relied on complex linguistic analysis to understand as much as possible about the meaning of the questions being asked. The researchers behind AskMSR had a different idea. Instead of doing sophisticated linguistic analysis of questions they decided to do a much simpler analysis, but to draw on a rich database of knowledge to find answers – the web. As they put it:

In contrast to many question answering systems that begin with rich linguistic resources (e.g., parsers, dictionaries, WordNet), we begin with data and use that to drive the design of our system. To do this, we first use simple techniques to look for answers to questions on the Web.

As I describe in detail below, their approach was to take the question asked, to rewrite it in the form of a search engine query, or perhaps several queries, and then extract the answer by analysing the Google results for those queries.

The insight they had was that a large and diverse document collection (such as the web) would contain clear, easily extractable answers to a very large number of questions. Suppose, for example, that your QA system is trying to answer the question “Who killed Abraham Lincoln?” Suppose also that the QA system only received limited training data, including a text which read “Abraham Lincoln’s life was ended by John Wilkes Booth”. It requires a sophisticated analysis to understand that this means Booth killed Lincoln. If that were the only training text related to Lincoln and Booth, the system might not make the correct inference. On the other hand, with a much large document collection it would be likely that somewhere in the documents it would plainly say “John Wilkes Booth killed Abraham Lincoln”. Indeed, if the document collection were large enough this phrase (and close variants) might well be repeated many times. At that point it takes much less sophisticated analysis to figure out that “John Wilkes Booth” is a good candidate answer. As Brill et al put it:

[T]he greater the answer redundancy in the source, the more likely it is that we can find an answer that occurs in a simple relation to the question, and therefore, the less likely it is that we will need to resort to solving the aforementioned difficulties facing natural language processing systems.

To sum it up even more succinctly, the idea behind AskMSR was that: unsophisticated linguistic algorithms + large amounts of data sophisticated linguistic algorithms + only a small amount of data.

I’ll now describe how AskMSR worked. I limit my discussion to just a single type of question, questions of the form “Who…?” The AskMSR system could deal with several different question types (“When…?”, etc), but the ideas used for the other question types were similar. More generally, my description is an abridgement and simplification of the original AskMSR system, and you should refer to the original paper for more details. The benefit of my simplified approach is that we’ll be able to create a short Python implementation of the ideas.

How AskMSR worked

Rewriting: Suppose we’re asked a question like “Who is the richest person in the world?” If we’re trying to use Google ourselves to answer the question, then we might simply type the question straight in. But Google isn’t (so far as I know) designed with question-answering in mind. So if we’re a bit more sophisticated we’ll rewrite the text of the question to turn it into a query more likely to turn up answers, queries like richest person world or world's richest person. What makes these queries better is that they omit terms like “Who” which are unlikely to appear in answers to the question we’re interested in.

AskMSR used a similar idea of rewriting, based on very simple rules. Consider “Who is the richest person in the world?” Chances are pretty good that we’re looking for a web page with text of the form “the richest person in the world is *”, where * denotes the (unknown) name of that person. So a good strategy is to search for pages matching the text “the richest person in the world is” exactly, and then to extract whatever name comes to the right.

There are a few things to note about how we rewrote the question to get the search query.

Most obviously, we eliminated “Who” and the question mark.

A little less obviously, we moved the verb “is” to the end of the phrase. Moving the verb in this way is common when answering questions. Unfortunately, we don’t always move the verb in the same way. Consider a question such as “Who is the world’s richest person married to?” The best way to move the verb is not to the end of the sentence, but rather to move it between “the world’s richest person” and “married to”, so we are searching for pages matching the text “the world’s richest person is married to *”.

More generally, we’ll rewrite questions of the form Who w0 w1 w2 w3 ... (where w0 w1 w2 ... are words) as the following search queries (note that the quotes matter, and are part of the query):

"w0 w1 w2 w3 ... "
"w1 w0 w2 w3 ... "
"w1 w2 w0 w3 ... "
...

It’s a carpet-bombing strategy, trying all possible options for the correct place for the verb in the text. Of course, we end up with many ridiculous options, like "the world's is richest person married to". However, according to Brill et al, “While such an approach results in many nonsensical rewrites… these very rarely result in the retrieval of bad pages, and the proper movement position is guaranteed to be found via exhaustive search.”

Why does this very rarely result in the retrieval of bad pages? Ultimately, the justification must be empirical. However, a plausible story is that rewrites such as "the world's is richest person married to" don’t make much sense, and so are likely to match only a very small collection of webpages compared to (say) "world's richest person is married to". Because of this, we will see few repetitions in any erroneous answers extracted from the search results for the nonsensical query, and so the nonsensical query is unlikely to result in incorrect answers.

Problems for the author

Contrary to the last paragraph, I think it’s likely that sometimes the rewritten questions do make sense as search queries, but not as ways of searching for answers to the original question. What are some examples of such questions?

Another feature of the queries above is that they are all quoted. Google uses this syntax to specify that an exact phrase match is required. Below we’ll introduce an unquoted rewrite rule, meaning that an exact phrase match isn’t required.

With the rules I’ve described above we rewrite “Who is the world’s richest person?” as both the query “is the world’s richest person” and the query “the world’s richest person is”. There are several other variations, but for now we’ll focus on just these two. For the first query, “is the world’s richest person”, notice that if we find this text in a page then it’s likely to read something like “Bill Gates is the world’s richest person” (or perhaps “Carlos Slim is…”, since Slim has recently overtaken Gates in wealth). If we search on the second query, then returned pages are more likely to read “the world’s richest person is Bill Gates”.

What this suggests is that accompanying the rewritten query, we should also specify whether we expect the answer to appear on the left (L) or the right (R). Brill et al adopted the rule to expect the answer on the left when the verb starts the query, and on the right when the verb appears anywhere else. With this change the specification of the rewrite rules becomes:

Brill et al don’t justify this choice of locations. I haven’t thought hard about it, but the few examples I’ve tried suggest that it’s not a bad convention, although I’ll bet you could come up with counterexamples.

For some rewritten queries Brill allow the answer to appear on either (E) side. In particular, they append an extra rewrite rule to the above list which is:

[w1 w2 w3..., E]

This differs in two ways to the earlier rewrite rules. First, the search query is unquoted, so it doesn’t require an exact match, or for the order of the words to be exactly as specified. Second, the verb is entirely omitted. For these reasons it makes sense to look for answers on either side of the search text.

Of course, if we extract a candidate answer from this less precise search then we won’t be as confident in the answer. For that reason we also assign a score to each rewrite rule. Here’s the complete set of rewrite rules we use, including scores. (I’ll explain how we use the scores a little later: for now all that matters is that higher scores are better.)

Problems for the author

At the moment we’ve implicitly assumed that the verb is a single word after “Who”. However, sometimes the verb will be more complex. For example, in the sentence “Who shot and killed President Kennedy?” the verb is “shot and killed”. How can we identify such complex compound verbs?

Extracting candidate answers: We submit each rewritten query to Google, and extract Google’s document summaries for the top 10 search results. We then split the summaries into sentences, and each sentence is assigned a score, which is just the score of the rewrite rule the sentence originated from.

How should we extract candidate answers from these sentences? First, we remove text which overlaps the query itself – we’re looking for terms near the query, not the same as the query! This ensures that we don’t answer questions such as “Who killed John Fitzgerald Kennedy?” with “John Fitzgerald Kennedy”. We’ll call the text that remains after deletion a truncated sentence.

With the sentence truncated, the idea then is to look for 1-, 2-, and 3-word strings (n-grams) which recur often in the truncated sentences. To do this we start by listing every n-gram that appears in at least one truncated sentence. This is our list of candidate answers. For each such n-gram we’ll assign a total score. Roughly speaking, the total score is the sum of the scores of all the truncated sentences that the n-gram appears in.

In fact, we can improve the performance of the system by modifying this scoring procedure. We do this by boosting the score of a sentence if one or more words in the n-gram are capitalized. We do this because capitalization likely indicates a proper noun, which means the word is an especially good candidate to be part of the correct answer. The system outputs the n-gram which has the highest total score as its preferred answer.

Toy program: Below is a short Python 2.7 program which implements the algorithm described above. (It omits the left-versus-right-versus-either distinction, though.) The code is also available on GitHub, together with a small third-party library (via Mario Vilas) that’s used to access Google search results. Here’s the code:

#### mini_qa.py
#
# A toy question-answering system, which uses Google to attempt to
# answer questions of the form "Who... ?" An example is: "Who
# discovered relativity?"
#
# The design is a simplified version of the AskMSR system developed by
# researchers at Microsoft Research. The original paper is:
#
# Brill, Lin, Banko, Dumais and Ng, "Data-Intensive Question
# Answering" (2001).
#
# I've described background to this program here:
#
# http://michaelnielsen.org/ddi/how-to-answer-a-question-v1/
#### Copyright and licensing
#
# MIT License - see GitHub repo for details:
#
# https://github.com/mnielsen/mini_qa/blob/blog-v1/mini_qa.py
#
# Copyright (c) 2012 Michael Nielsen
#### Library imports
# standard library
from collections import defaultdict
import re
# third-party libraries
from google import search
def pretty_qa(question, num=10):
"""
Wrapper for the `qa` function. `pretty_qa` prints the `num`
highest scoring answers to `question`, with the scores in
parentheses.
"""
print "\nQ: "+question
for (j, (answer, score)) in enumerate(qa(question)[:num]):
print "
def qa(question):
"""
Return a list of tuples whose first entry is a candidate answer to
`question`, and whose second entry is the score for that answer.
The tuples are ordered in decreasing order of score. Note that
the answers themselves are tuples, with each entry being a word.
"""
answer_scores = defaultdict(int)
for query in rewritten_queries(question):
for summary in get_google_summaries(query.query):
for sentence in sentences(summary):
for ngram in candidate_answers(sentence, query.query):
answer_scores[ngram] += ngram_score(ngram,
query.score)
return sorted(answer_scores.iteritems(),
key=lambda x: x[1],
reverse=True)
def rewritten_queries(question):
"""
Return a list of RewrittenQuery objects, containing the search
queries (and corresponding weighting score) generated from
`question`.
"""
rewrites = []
tq = tokenize(question)
verb = tq[1] # the simplest assumption, something to improve
rewrites.append(
RewrittenQuery("\"
for j in range(2, len(tq)):
rewrites.append(
RewrittenQuery(
"\"
" ".join(tq[2:j+1]), verb, " ".join(tq[j+1:])),
5))
rewrites.append(RewrittenQuery(" ".join(tq[2:]), 2))
return rewrites
def tokenize(question):
"""
Return a list containing a tokenized form of `question`. Works by
lowercasing, splitting around whitespace, and stripping all
non-alphanumeric characters.
"""
return [re.sub(r"\W", "", x) for x in question.lower().split()]
class RewrittenQuery():
"""
Given a question we rewrite it as a query to send to Google.
Instances of the RewrittenQuery class are used to store these
rewritten queries. Instances have two attributes: the text of the
rewritten query, which is sent to Google; and a score, indicating
how much weight to give to the answers. The score is used because
some queries are much more likely to give highly relevant answers
than others.
"""
def __init__(self, query, score):
self.query = query
self.score = score
def get_google_summaries(query):
"""
Return a list of the top 10 summaries associated to the Google
results for `query`. Returns all available summaries if there are
fewer than 10 summaries available. Note that these summaries are
returned as BeautifulSoup.BeautifulSoup objects, and may need to
be manipulated further to extract text, links, etc.
"""
return search(query)
def sentences(summary):
"""
Return a list whose entries are the sentences in the
BeautifulSoup.BeautifulSoup object `summary` returned from Google.
Note that the sentences contain alphabetical and space characters
only, and all punctuation, numbers and other special characters
have been removed.
"""
text = remove_spurious_words(text_of(summary))
sentences = [sentence for sentence in text.split(".") if sentence]
return [re.sub(r"[^a-zA-Z ]", "", sentence) for sentence in sentences]
def text_of(soup):
"""
Return the text associated to the BeautifulSoup.BeautifulSoup
object `soup`.
"""
return ".join(str(soup.findAll(text=True)))
def remove_spurious_words(text):
"""
Return `text` with spurious words stripped. For example, Google
includes the word "Cached" in many search summaries, and this word
should therefore mostly be ignored.
"""
spurious_words = ["Cached", "Similar"]
for word in spurious_words:
text = text.replace(word, "")
return text
def candidate_answers(sentence, query):
"""
Return all the 1-, 2-, and 3-grams in `sentence`. Terms appearing
in `query` are filtered out. Note that the n-grams are returned
as a list of tuples. So a 1-gram is a tuple with 1 element, a
2-gram is a tuple with 2 elements, and so on.
"""
filtered_sentence = [word for word in sentence.split()
if word.lower() not in query]
return sum([ngrams(filtered_sentence, j) for j in range(1,4)], [])
def ngrams(words, n=1):
"""
Return all the `n`-grams in the list `words`. The n-grams are
returned as a list of tuples, each tuple containing an n-gram, as
per the description in `candidate_answers`.
"""
return [tuple(words[j:j+n]) for j in xrange(len(words)-n+1)]
def ngram_score(ngram, score):
"""
Return the score associated to `ngram`. The base score is
`score`, but it's modified by a factor which is 3 to the power of
the number of capitalized words. This biases answers toward
proper nouns.
"""
num_capitalized_words = sum(
1 for word in ngram if is_capitalized(word))
return score * (3**num_capitalized_words)
def is_capitalized(word):
"""
Return True or False according to whether `word` is capitalized.
"""
return word == word.capitalize()
if __name__ == "__main__":
pretty_qa("Who ran the first four-minute mile?")
pretty_qa("Who makes the best pizza in New York?")
pretty_qa("Who invented the C programming language?")
pretty_qa("Who wrote the Iliad?")
pretty_qa("Who caused the financial crash of 2008?")
pretty_qa("Who caused the Great Depression?")
pretty_qa("Who is the most evil person in the world?")
pretty_qa("Who wrote the plays of Wiliam Shakespeare?")
pretty_qa("Who is the world's best tennis player?")
pretty_qa("Who is the richest person in the world?")

If you run the program you’ll see that the results are a mixed bag. When I tested it, it knew that Roger Bannister ran the first four-minute mile, that Dennis Ritchie invented the C programming language, and that Homer wrote the Iliad. On the other hand, the program sometimes gives answers that are either wrong or downright nonsensical. For example, it thinks that Occupy Wall Street caused the financial crash of 2008 (Ben Bernanke also scores highly). And it replies “Laureus Sports Awards” when asked who is the world’s best tennis player. So it’s quite a mix of good and bad results.

While developing the program I used some questions repeatedly to figure out how to improve perfomance. For example, I often asked it the questions “Who invented relativity?” and “Who killed Abraham Lincoln?” Unsuprisingly, the program now answers both questions correctly! So to make things fairer the questions used as examples in the code aren’t ones I tested while developing the program. They’re still far from a random sample, but at least the most obvious form of bias has been removed.

Much can be done to improve this program. Here are a few ideas:

Problems

To solve many of the problems I describe below it would help to have a systematic procedure to evaluate the performance of the system, and, in particular, to compare the performance of different versions of the system. How can we build such a systematic evaluation procedure?

The program does no sanity-checking of questions. For example, it simply drops the first word of the question. As a result, the question “Foo killed Abraham Lincoln?” is treated identically to “Who killed Abraham Lincoln?” Add some basic sanity checks to ensure the question satisfies standard constraints on formatting, etc.

More generally, the program makes no distinction between sensible and nonsensical questions. You can ask it “Who coloured green ideas sleeping furiously?” and it will happily answer. (Perhaps appropriately, “Chomsky” is one of the top answers.) How might the program figure out that the question doesn’t make sense? (This is a problem shared by many other systems. Here is to the question, and also Google’s response.)

The program boosts the score of n-grams containing capitalized words. This makes sense most of the time, since a capitalized word is likely to be a proper noun, and is thus a good candidate to be the answer to the question. But it makes less sense when the word is the first word of the sentence. What should be done then?

In the sentences extracted from Google we removed terms which appeared in the original query. This means that, for example, we won’t answer “Who killed John Fitzgerald Kennedy?” with “John Fitzgerald Kennedy”. It’d be good to take this further by also eliminating synonyms, so that we don’t answer “JFK”. How can this be done?

There are occasional exceptions to the rule that answers don’t repeat terms in the question. For example, the correct answer to the question “Who wrote the plays of William Shakespeare?” is, most likely, “William Shakespeare”. How can we identify questions where this is likely to be the case?

How does it change the results to use more than 10 search summaries? Do the results get better or worse?

An alternative approach to using Google would be to use Bing, Yahoo! BOSS, Wolfram Alpha, or Cyc as sources. Or we could build our own source using tools such as Nutch and Solr. How do the resulting systems compare to one another? Is it possible to combine the sources to do better than any single source alone?

Some n-grams are much more common than others: the phrase “he was going” occurs far more often in English text than the phrase “Lee Harvey Oswald”, for example. Because of this, chances are that repetitions of the phrase “Lee Harvey Oswald” in search results are more meaningful than repetitions of more common phrases, such as “he was going”. It’d be natural to modify the program to give less common phrases a higher score. What’s the right way of doing this? Does using inverse document frequency give better performance, for example?

Question-answering is really three problems: (1) understanding the question; (2) figuring out the answer; and (3) explaining the answer. In this post I’ve concentrated on (2) and (to some extent) (1), but not (3). How might we go about explaining the answers generated by our system?

Discussion

Let’s come back to the idea I described in the opening: unsophisticated algorithms + large amounts of data sophisticated linguistic algorithms + only a small amount of data.

[I]nvariably, simple models and a lot of data trump more elaborate models based on less data.

They were writing in the context of machine translation and speech recognition, but this point of view has become commonplace well beyond machine translation and speech recognition. For example, at the 2012 O’Reilly Strata California conference on big data, there was a debate on the idea that “[i]n data science, domain expertise is more important than machine learning skill.” The people favouring the machine learning side won the debate, at least in the eyes of the audience. Admittedly, a priori you’d expect this audience to strongly favour machine learning. Still, I expect that just a few years earlier even a highly sympathetic audience would have tilted the other way.

An interesting way of thinking about this idea that data trumps the quality of your models and algorithms is in terms of a tradeoff curve:

The curve shown is one of constant performance: as you increase the amount of training data, you decrease the quality of the algorithm required to get the same performance. The tradeoff between the two is determined by the slope of the curve.

A difficulty with the “more data is better” point of view is that it’s not clear how to determine what the tradeoffs are in practice: is the slope of the curve very shallow (more data helps more than better algorithms), or very steep (better algorithms help more than more data). To put it another way, it’s not obvious whether to focus on acquiring more data, or on improving your algorithms. Perhaps the correct moral to draw is that this is a key tradeoff to think about when deciding how to allocate effort. At least in the case of the AskMSR system, taking the more data idea seriously enabled the team to very quickly build a system that was competitive with other systems which had taken much longer to develop.

A second difficulty with the more data is better point of view is, of course, that sometimes more data is exremely expensive to generate. IBM has said that it intends to use Watson to help doctors answer medical questions. Yet for such highly specialized questions it’s not so clear that it will be easy to find data sources which meet the criteria outlined by Brill et al:

[T]he greater the answer redundancy in the source, the more likely it is that we can find an answer that occurs in a simple relation to the question, and therefore, the less likely it is that we will need to resort to solving the aforementioned difficulties facing natural language processing systems.

Data versus understanding: I’ll finish by briefly discussing one possible criticism of the data-driven approach to question-answering. That criticism is that it rejects detailed understanding in favour of a simplistic data-driven associative model. It’s tempting to think that this must therefore be the wrong track, a blind alley that may pay off in the short term, but which will ultimately be unfruitful. I think that point of view is shortsighted. We don’t yet understand how human cognition works, but we do know that we often use after-the-fact rationalization and confabulation, suggesting that the basis for many of our decisions isn’t a detailed understanding, but something more primitive. This isn’t to say that rationalization and confabulation are desirable. But it does mean that it’s worth pushing beyond narrow conceptions of understanding, and trying to determine how much of our intelligence can be mimicked using simple data-driven associative techniques, and by combining such techniques with other ideas.

On my first day of physics graduate school, the professor in my class on electromagnetism began by stepping to the board, and wordlessly writing four equations:

He stepped back, turned around, and said something like [1]: “These are Maxwell’s equations. Just four compact equations. With a little work it’s easy to understand the basic elements of the equations – what all the symbols mean, how we can compute all the relevant quantities, and so on. But while it’s easy to understand the elements of the equations, understanding all their consequences is another matter. Inside these equations is all of electromagnetism – everything from antennas to motors to circuits. If you think you understand the consequences of these four equations, then you may leave the room now, and you can come back and ace the exam at the end of semester.”

Alan Kay has famously described Lisp as the “Maxwell’s equations of software”. He describes the revelation he experienced when, as a graduate student, he was studying the LISP 1.5 Programmer’s Manual and realized that “the half page of code on the bottom of page 13… was Lisp in itself. These were “Maxwell’s Equations of Software!” This is the whole world of programming in a few lines that I can put my hand over.”

Here’s the half page of code that Kay saw in that manual:

What we’re going to do in this essay is understand what that half page of code means, and what it means that Lisp is the Maxwell’s equations of software. However, we won’t literally work through the half page of code above. Instead, we’ll do something much more informative: we’ll create a modern, fully executable equivalent of the code above. Furthermore, to make this essay accessible, I won’t assume that you know Lisp. Instead, I’ll teach you the basic elements of Lisp.

That perhaps sounds over-ambitious, but the good news is that it’s easy to learn the basic elements of Lisp. Provided you have a little facility with computer programming and comfort with mathematics, you can learn how Lisp works in just a few minutes. Frankly, it’s much easier than understanding the elements of Maxwell’s equations! And so I’ll start by explaining a subset of the Lisp programming language, and getting you to write some Lisp code.

But I won’t stop with just showing you how to write some Lisp. Once we’ve done that we’re going to write an interpreter for Lisp code. In particular, we’ll create a interpreter based on a beautiful Lisp interpreter written by Peter Norvig, which contains just 90 lines of Python code. Our interpreter will be a little more complex, due mostly to the addition of a few conveniences absent from Norvig’s interpreter. The code is still simple and easy to understand, provided you’re comfortable reading Python code. As we’ll see, the benefit of writing the interpreter is not just that it gives us a running interpreter (although that’s no small thing). It’s that writing an interpreter also deepens our understanding of Lisp. It does that by taking what would otherwise be some rather abstract concepts in our description of Lisp, and giving them concrete, tangible representations in terms of Python code and data structures. By making concrete what was formerly abstract, the code for our Lisp interpreter gives us a new way of understanding how Lisp works.

With our Python Lisp interpreter up and running, we’ll then write a modern equivalent to the code on the bottom of page 13 of the LISP 1.5 Programmer’s Manual. But while our code will be essentially the same as the code from page 13, it will have the considerable advantage that it’s also executable. We can, if we wish, play with the code, modify it, and improve it. In other words, it’s a living version of the Maxwell’s equations of software! Furthermore, with our new understanding it becomes an easy and fun exercise to understand all the details on page 13 of the LISP Manual.

This second part of the essay is based primarily on two sources: the first chapter of the LISP 1.5 Manual, of course, but also an essay by Paul Graham (postscript) in which he explains some of the early ideas behind Lisp. Incidentally, “LISP” is the capitalization used in the LISP Manual, but otherwise I’ll use the modern capitalization convention, and write “Lisp”.

The great Norwegian mathematician Niels Henrik Abel was once asked how he had become so good at mathematics. He replied that it was “by studying the masters, not their pupils”. The current essay is motivated by Abel’s admonishment. As a programmer, I’m a beginner (and almost completely new to Lisp), and so this essay is a way for me to work in detail through ideas from masters such as Alan Kay, Peter Norvig, and Paul Graham. Of course, if one takes Abel at his word, then you should stop reading this essay, and instead go study the works of Kay, Norvig, Graham, and the like! I certainly recommend taking the time to study their work, and at the end of the essay I make some recommendations for further reading. However, I hope that this essay has a distinct enough point of view to be of interest in its own right. Above all, I hope the essay makes thinking about Lisp (and programming) fun, and that it raises some interesting fundamental questions. Of course, as a beginner, the essay may contain some misunderstandings or errors (perhaps significant ones), and I’d welcome corrections, pointers, and discussion.

Some elements of Lisp

In this and the next two sections we’ll learn the basic elements of Lisp – enough to develop our own executable version of what Alan Kay saw on page 13 of the LISP Manual. We’ll call the dialect of Lisp we develop “tiddly Lisp”, or just tiddlylisp for short. Tiddlylisp is based on a subset of the programming language Scheme, which is one of the most popular modern dialects of Lisp.

While I can show you a bunch of examples of Lisp, you’ll learn much more if you type in the examples yourself, and then play around with them, modifying them and trying your own ideas. So I want you to download the file tiddlylisp.py to your local machine. Alternately, if you’re using git, you can just clone the entire code repository associated to this essay. The file tiddlylisp.py is the Lisp interpreter whose design and code we’ll work through later in the essay. On Linux and Mac you can start the tiddlylisp interpreter by typing python tiddlylisp.py from the command line. You should see a prompt:

tiddlylisp>

That’s where you’re going to type in the code from the examples – it’s an interactive Lisp interpreter. You can exit the interpreter at any time by hitting Ctrl-C. Note that the interpreter isn’t terribly complete – as we’ll see, it’s only 153 lines! – and one way it’s incomplete is that the error messages aren’t very informative. Don’t spend a lot of time worrying about the errors, just try again.

If you’re on Windows, and don’t already have Python installed, you’ll need to download it (I suggest Python 2.7 for this code), before starting the interpreter by running python tiddlylisp.py. Note that I’ve only tested the interpreter on Ubuntu Linux, not on Windows or the Mac.

As our first example, type the following code into the tiddlylisp interpreter:

tiddlylisp> (+ 2 3)
5

In the expression you typed in, + is a built-in procedure, which represents the addition operation. It’s being applied to two arguments, in this case the numbers 2 and 3. The interpreter evaluates the result of applying + to 2 and 3, i.e., of adding 2 and 3 together, returning the value5, which it then prints.

That first example was extremely simple, but it contains many of the concepts we need to understand Lisp: expressions, procedures, arguments, evaluation, returning a value, and printing. We’ll see many more illustrations of these ideas below.

Here’s a second example, illustrating another built-in procedure, this time the multiplication procedure, *:

tiddlylisp> (* 3 4)
12

It’s the same basic story: * is a built-in procedure, this time representing multiplication, and is here being applied to the numbers 3 and 4. The interpreter evaluates the expression, and prints the value returned, which is 12.

One potentially confusing thing about the first two examples is that I’ve called + and * “procedures”, yet in many programming languages procedures don’t return a value, only functions do. I’m using this terminology because it’s the standard terminology in the programming language Scheme, which is the dialect of Lisp that tiddlylisp is based on. In fact, in some modern dialects of Lisp – such as Common Lisp – operations such as + and * would be called “functions”. But we’ll stick with Scheme’s useage, and talk only of procedures.

Another example, illustrating a slightly different type of procedure:

tiddlylisp> (< 10 20)
True

Here, the built-in procedure < represents the comparison operator. And so tiddlylisp prints the result of evaluating an expression which compares the constants 10 and 20. The result is True, since 10 is less than 20. By contrast, we have

tiddlylisp> (< 17 12)
False

since 17 is not less than 12.

Many Lisp beginners are initially confused by this way of writing basic numerical operations. We're so familiar with expressions such as 2 + 3 that the changed order in (+ 2 3) appears strange and unfamiliar. And yet our preference for the infix notation 2 + 3 over the prefix notation (+ 2 3) is more a historical accident than a consequence of anything fundamental about arithmetic. Unfortunately, this has the consequence that some people back away from Lisp, simply because they dislike thinking in this unfamiliar way.

In this essay we won't get deep enough into Lisp to see in a lot of concrete detail why the prefix notation is a good idea. However, we will get one hint: our tiddlylisp interpreter will be much simpler for using the same (prefix) style for all procedures. If you really strongly dislike the prefix notation, then I challenge you to rewrite tiddlylisp so that it uses infix notation for some operations, and prefix notation for others. What you'll find is that the interpreter becomes quite a bit more complex. And so there is a sense in which using the same prefix style everywhere makes Lisp simpler.

Another useful thing Lisp allows us to do is to nest expressions:

tiddlylisp> (< (* 11 19) (* 9 23))
False

To compute the value of this expression, tiddlylisp evaluates the nested expressions, returning 209 from (* 11 19), and 207 from (* 9 23). The output from the outer expression is then just the result of evaluating (< 209 207), which of course is False.

We can use tiddlylisp to define variables:

tiddlylisp> (define x 3)
tiddlylisp> (* x 2)
6

define is used to define new variables; we can also assign a value to a variable which has previously been defined:

tiddlylisp> (set! x 4)
tiddlylisp> x
4

One question you may have here is about the slightly unusual syntax: set!. You might wonder whether the exclamation mark means something special - maybe set! is some type of hybrid procedure or something. Actually, there's no such complexity: set! is just a single keyword, just like define, nothing special about it at all. It's just that tiddlylisp allows exclamation marks in keyword names. So there's nothing special going on indicated by the exclamation mark.

A deeper question is why we don't simply eleminate define, and make it so set! checks whether a variable has been defined already, and if not, does so. I'll explain a little bit later why we don't do this; for now, you should just note the distinction.

We can use define to define new procedures in a way similar to how we use it to define variables. Here's an example showing how to define a procedure named square. I'll unpack what's going on below, but first, here's the code,

Ignore the first of the lines above for now. You can see from the second line what the procedure square does: it takes a single number as input, and returns the square of the number.

What about the first line of code above? We already know enough to guess quite a bit about how this line works: a procedure named square is being defined, and is assigned the value of the expression (lambda (x) (* x x)), whatever that value might be. What's new, and what we need to understand, is what the value of the expression (lambda (x) (* x x)) is. To understand this, let's break the expression into three pieces: lambda, (x), and (* x x). We'll understand the three pieces separately, and then put them back together.

The first piece of the expression - the lambda - simply tells the tiddlylisp interpreter that this expression is defining a procedure. I must admit that when I first encountered the lambda notation I found this pretty confusing [2] - I thought that lambda must be a variable, or an argument, or something like that. But no, it's just a big fat red flag to the tiddlylisp interpreter saying "Hey, this is a procedure definition". That's all lambda is.

The second part of the expression - the (x) - tells tiddlylisp that this is a procedure with a single argument, and that we're going to use the temporary name x for that argument, for the purposes of defining the procedure. If the procedure definition had started instead with (lambda (x y) ...) that would have meant that the procedure had two arguments, temporarily labelled x and y for the purposes of defining the procedure.

The third part of the expression - the (* x x) - is the meat of the procedure definition. It's what we evaluate and return when the procedure is called, with the actual values for the arguments of the procedure substituted in place of x.

Taking it all together, then, the value of the expresion (lambda (x) (* x x)) is just a procedure with a single input, and returning the square of that input. This procedure is anonymous, i.e., it doesn't have a name. But we can give it a name by using define, and so the line

tiddlylisp> (define square (lambda (x) (* x x)))

tells tiddlylisp to define something called square, whose value is a procedure (because of the lambda) with a single argument (because of the (x)), and what that procedure returns is the square of its argument (because of the (* x x)).

An important point about the variables used in defining procedures is that they're dummy variables. Suppose we wanted to define a procedure area which would return the area of a triangle, with arguments the base of the triangle and the height. We could do this using the following procedure definition:

It probably won't surprise you to learn that inside the procedure definition, i.e., immediately after the lambda (b h), the h is treated as a dummy variable, and is entirely different from the h outside the procedure definition. It has what's called a different scope. And so we can continue the above with the following

tiddlylisp> (area 3 h)
16.5

that is, the area is just 0.5 times 3 times the value of h set earlier, outside the procedure definition. At that point the value of h was 11, and so (area 3 h) returns 16.5.

There's a variation on the above that you might wonder about, which is what happens when you use variables defined outside a procedure inside that procedure definition? For instance, suppose we have:

What happens now if we evaluate foo? Well, tiddlylisp does the sensible thing, and interprets x as it was defined outside the procedure definition, so we have:

tiddlylisp> (foo 4)
12

What happens to our procedure if we next change the value of x? In fact, this changes foo:

tiddlylisp> (set! x 5)
tiddlylisp> (foo 4)
20

In other words, in the procedure definition lambda (y) (* x y) the x really does refer to the variable x, and not to the particular value x might have at any given point in time.

Let's dig down a bit more into how tiddlylisp handles scope and dummy variables. Let's look at what happens when we define a variable:

tiddlylisp> (define x 5)
tiddlylisp> x
5

The way the interpreter handles this internally is that it maintains what's called an environment: a dictionary whose keys are the variable names, and whose values are the corresponding variable values. So what the interpeter does when it sees the first line above is add a new key to the environment, "x", with value 5. When the interpreter sees the second line, it consults the environment, looks up the key x, and returns the corresponding value. You can, if you like, think of the environment as the interpreter's memory or data store, in which it stores all the details of the variables defined to date.

All this is pretty simple. We can go along defining and changing variables, and the interpreter just keeps consulting and modifying the environment as necessary. But when you define a new procedure using lambda the interpreter treats the variables used in the definition slightly differently. Let's look at an example:

What I want to concentrate on here is the procedure definition (lambda (b) (* b h)). Up to this point the interpreter had been chugging along, modifying its environment as appropriate. What happens when it sees the (lambda ...), though, is that the interpreter creates a new environment, an environment called an inner environment, as distinguished from the outer environment, which is what the interpreter has been operating in until it reached the (lambda ...) statement. The inner environment is a new dictionary, whose keys initially are just the arguments to the procedure - in this case a single key, b - and whose values will be supplied when the procedure is called.

To recap, what the interpreter does when it sees (lambda (b) (* b h)) is create a new inner environment, with a key b whose value will be set when the procedure is called. When evaluating the expression (* b h) (which defines the result returned from the procedure) what the interpreter does is first consult the inner environment, where it finds the key b, but not the key h. When it fails to find h, it looks instead for h in the outer environment, where the key h has indeed been defined, and retrieves the appropriate value.

I've described a simple example showing how environments work, but tiddlylisp also allows us to have procedure definitions nested inside procedure definitions, nested inside procedure definitions (and so on). To deal with this, the top-level tiddlylisp interpreter operates inside a global environment, and each procedure definition creates a new inner environment, perhaps nested inside a previously created inner environment, if that's appropriate.

If all this talk about inner and outer environments has left you confused, fear not. At this stage it really is important to have gotten the gist of how environments work, but you shouldn't worry if the details still seem elusive. In my opinion, the best way to understand those details is not through abstract discussion, but instead by looking at the working code for tiddlylisp. We'll get to that shortly, but for now can move onwards armed with a general impression of how environments work.

The procedure definitions I've described so far evaluate and return the value from just a single expression. We can use such expressions to achieve suprisingly complex things, because of the ability to nest expressions. Still, it would be convenient to have a way of chaining together expressions that doesn't involve nesting. A way of doing this is to use the begin keyword. begin is especially helpful in defining complex procedures, and so I'll give an example in that context:

This line is defining a procedure called area, with a single argument r, and where the value of (area r) is just the value of the expression

(begin (define pi 3.14) (* pi (* r r)))

with the appropriate value for r substituted. The way tiddlylisp evaluates the begin expression above is that it evaluates all the separate sub-expressions, consecutively in the order they appear, and then returns the value of the final sub-expression, in this case the sub-expression (* pi (* r r)). So, for example, we get

tiddlylisp> (area 3)
28.26

which is just the area of a circle with radius 3.

Now, in a simple example such as this you might argue that it makes more sense to avoid defining pi, and just put 3.14 straight into the later expression. However, doing so will make the intent of the code less clear, and I trust you will find it easy to imagine more complex cases where it makes even more sense to use multi-part begin expressions. This is especially the case since it is permissible to split Lisp expressions over multiple lines, so the above procedure definition could have been written:

(define area (lambda (r)
(begin (define pi 3.14)
(* pi (* r r)))))

Now, I should mention that if you enter the above into the tiddlylisp interpreter, you'll get errors. This is because the tiddlylisp interpreter is so stripped down that it doesn't allow multi-line inputs. However, tiddlylisp will allow you to load multi-line expressions from a file - try saving the above three lines in a file named area.tl, and then running that file with python tiddlylisp.py area.tl. Tiddlylisp will execute the code, defining the area procedure, and then leave you in the interpreter, where you can type things like:

tiddlylisp> (area 3)
28.26

A second advantage of using begin in the above code is that the variable pi is only defined in the inner environment associated to the lambda expression. If, for some reason, you want to define pi differently outside that scope, you can do so, and it won't be affected by the definition of area. Consider, for example, the following sequence of expressions:

As you can see by comparing the final line of this example to our earlier example, there really is a significant difference between define and set!. In particular, when we define area in this example, what set! does is check to see whether the inner environment contains a variable named pi. Since it doesn't, it checks the outer environment, where it does find such a variable, and that's what set! updates. If we'd used define instead it would have created a completely new variable named pi in the inner environment, while leaving pi in the outer environment untouched, so the final line would have returned 3 instead. So having both define and set! gives us quite a bit of flexibility and control over which environment is being used, at the expense of a complication in syntax.

As our final piece of Lisp before putting what we've learnt together to construct a nontrivial example, tiddlylisp includes a keyword if that can be used to test conditions, and conditionally return the value of different expressions. Here's an example showing the use of if to evaluate the absolute value of a variable:

tiddlylisp> (define x -2)
tiddlylisp> (if (> x 0) x (- 0 x))
2

We can use the same idea to define a procedure abs which returns the absolute value of a number:

The general form for if expressions is (if cond result alt), where we evaluate the expression cond, and if the value is True, return the value of result, otherwise return the value of alt. Note that in the expression (if cond result alt), the cond, result and alt of course shouldn't be read literally. Rather, they are placeholders for other expressions, as we saw in the absolute value example above. Through the remainder of this essay I will use use italics to indicate such placeholder expressions.

Let me conclude this section by briefly introducing two important Lisp concepts. The first is the concept of a special form. In this section we discussed several built-in procedures, such as + and *, and also some user-defined procedures. Calls to such procedures always have the form (proc exp1 exp2...), where proc is the procedure name, and the other arguments are expressions. Such expressions are evaluated by evaluating each individual expression exp1, exp2..., and then passing those values to the procedure. However, not all Lisp expressions are procedure calls. Consider the expression (define pi 3.14). This isn't evaluated by calling some procedure define with arguments given by the value of pi and the value of 3.14. It can't be evaluated this way because pi doesn't have a value yet! So define isn't a procedure. Instead, define is an example of what is known as a special form. Other examples of special forms include lambda, begin, and if, and a few more which we'll meet later. Like define, none of these is a procedure, but instead each special forms has its own special rule for evaluation.

The second concept I want to briefly introduce is that of a list. The list is one of the basic data structures used in Lisp, and even gives the language its name - Lisp is short for "list processing". In fact, most of the Lisp expressions we've seen up to now are lists: an expression such as (abs 2) is a two-element list, with elements abs and 2, delimited by spaces and parentheses. A more complex expression such as (define abs (lambda (x) (if (> x 0) x (- 0 x)))) is also a list, in this case with the first two elements being define and abs. The third element, (lambda (x) (if (> x 0) x (- 0 x))), is a sublist which in turn has sublists (and subsublists) of its own. Later we'll see how to use Lisp to do manipulations with such lists.

A nontrivial example: square roots using only elementary arithmetic

Let's put together the ideas above to do something nontrivial. We'll write a short tiddlylisp program to compute square roots, using only the elementary arithmetical operations (addition, subtraction, multiplication, and division). The idea behind the program is to use Newton's method. Although Newton's method is interesting in its own right, I'm not including this example because of its algorithmic elegance. Instead, I'm including it as a simple and beautiful example of a Lisp program. The example comes from Abelson and Sussman's book on the Structure and Interpretation of Computer Programs.

Here's how Newton's method for computing square roots works. Suppose we have a (positive) number x whose square root we wish to compute. Then we start by making a guess at the square root, guess. We're going to start by arbitrarily choosing 1.0 as our initial value for guess; in principle, any positive number will do. According to Newton's method, we'll get an improved guess by computing (guess+x/guess)/2, i.e., by taking the average of guess and x/guess. If we repeat this averaging process enough times, we'll converge to a good estimate of the square root.

Let's express these ideas in Lisp code. We'll do it from the top down, starting at a high level, and gradually filling in the details of all the procedures we need. We start at the absolute top level,

(define sqrt (lambda (x) (sqrt-iter 1.0 x)))

Here, (sqrt-iter guess x) is the value of a to-be-defined procedure sqrt-iter that takes a guess guess at the square root of x, and keeps improving that guess over and over until it's a good enough estimate of the square root. As I mentioned above, we start by arbitrarily choosing guess = 1.0.

What sqrt-iter does is takes the guess and checks to see whether it's yet a good-enough? approximation to the square root of x, in a sense to be made precise below. If it is, then sqrt-iter is done, and returns guess, otherwise it applies sqrt-iter to the improved guess (improve guess x) supplied by applying Newton's method.

Incidentally, the way I've written sqrt-iter above it's another example of a multi-line tiddlylisp expression, and so it's not possible to enter in the tiddlylisp interpreter. However, you can enter it as part of a file, sqrt.tl, along with the definition of sqrt, and other procedures which we'll define below. We'll later use tiddlylisp to execute the file sqrt.tl.

To fill out the details of the above, we need to understand how good-enough? and improve work. Let's start with good-enough?, which simply checks whether the absolute value of the square of guess is within 0.00001 of x:

Save these lines to the file sqrt.tl, and then run it using python tiddlylisp.py sqrt.tl. Tiddlylisp will execute the code, defining all the above procedures, and then leave you in an interpreter, where you can type:

tiddlylisp> (sqrt 2.0)
1.41421568627

This is, indeed, a pretty good approximation to the true square root: 1.4142135....

In writing out the overall program sqrt.tl, I've reversed the order of the lines, compared to my initial explanation. The reason I've done this is worth discussing. You see, the program sqrt.tl was actually the first piece of Lisp code I ever worked though in detail. The natural way to think through the problem of computing square roots with Newton's method is in the order I explained, working in a top-down fashion, starting from the broad problem, and gradually breaking our attack on that problem down into parts. But the first time I wrote the code out I assumed that I needed to explain these ideas to Lisp in a bottom-up fashion, defining procedures like average before improve and so on, so that procedure definitions only include previously defined procedures. That meant reversing the order of the code, as I've shown above.

Taking this approach bugged me. It doesn't seem like the most natural way to think about implementing Newton's method. At least for this problem a top-down approach seems more natural, and if you were just doing exploratory programming you'd start by defining sqrt, then sqrt-iter, and so on. As an experiment, I decided to reverse the order of the progam, so it reflects the natural "thinking order":

Somewhat to my surprise, this ran just fine ! (Actually, my code was slightly different, since I was using Common Lisp at that point, not tiddlylisp, but tiddlylisp works as well.) It's apparently okay to define a procedure in terms of some other procedure which isn't defined until later. Now, upon close examination of the code for tiddlylisp it turns out that this makes perfect sense. But it came as a surprise to me. Later, when we examine the code for tiddlylisp, I'll set a problem asking you to explain why it's okay to re-order the code above.

I think this program for the square root is a very beautiful program. Of course, in most programming languages the square root is built in, and so it's not exactly notable to have an elegant program for it! Indeed, the square root is built in to most version of Lisp, and it's trivially possible to add the square root to tiddlylisp. But what I like about the above is that it's such a simple and natural expression of Newton's method. As Peter Deutsch has said, Lisp programs come close to being "executable mathematics".

Problems for the author

As a general point about programming language design it seems like it would often be helpful to be able to define procedures in terms of other procedures which have not yet been defined. Which languages make this possible, and which do not? What advantages does it bring for a programming language to be able to do this? Are there any disadvantages?

A few more elements of Lisp

I began my introduction to Lisp by focusing on elementary arithmetical operations, and how to put them together. I did that so I could give you a concrete example of Lisp in action - the square root example in the last section - which is a good way of getting a feel for how Lisp works. But if we're going to understand Lisp as "the Maxwell's equations of software" then we also need to understand more about how Lisp deals with expressions and with lists. I'll describe those operations in this section.

Before I describe those operations, I want to discuss a distinction that I haven't drawn much attention to until now. That's the distinction between an expression such as (+ 1 2) and the value of that expression, which in this case is 3. When we feed a (valid) tiddlylisp expression to the tiddlylisp interpreter, it evaluates the expression, and returns its value.

When I was first learning Lisp I'd often get myself in trouble because I failed to keep myself clear on the distinction between an expression and its value. Let me give you an example of the kind of confusion I'd get myself into. There is a built-in Lisp procedure called atom?, defined so that (atom? exp) returns True if the value of the expression exp is atomic - a number, for example, or anything which isn't a list - and returns False otherwise. (Note that exp is a placeholder expression, as we discussed earlier, and shouldn't be read literally as the identifier exp.) Now, I'd look at an example like (atom? 5) and have no trouble figuring out that it evaluated to True. But where I'd get into trouble is with an expression like (atom? (+ 1 2)). I'd look at it and think it must return False, because the expression (+ 1 2) is not atomic, it's a list. Unfortunately, while it's true that the expression (+ 1 2) is not atomic, it's irrelevant, because the value of (atom? exp) is determined by whether the value of exp is atomic - which it is, since the value of (+ 1 2)) is 3. You'll be a much happier Lisper if you always keep very clear on the distinction between expressions and the value of expressions.

Now, of course, this distinction between expressions and their values appears in most other programming languages. For this reason, I felt foolish when I understood what was causing my confusion. But what makes it so easy to make this kind of mistake is that in Lisp both expressions and the value of those expressions can be lists. And that's why when I write something like (atom? (+ 1 2)) there's a real question: is atom? checking whether the expression (+ 1 2) is atomic (no, it's not, it's a list), or whether the value of the expression - 3 - is atomic (yes, it is). So you need to be pretty careful to keep clear on which is meant. To help with this, I'll be pretty pedantic about the distinction in what follows, writing out "the value of exp" explicitly, to distinguish the value from the expression itself. Note, however, that it's quite common for people to be less pedantic and more informal, so that something like (+ x 2) may well be described as "adding the variable x to the variable y", not the more explicit "adding the value of the variable x to the value of the variable y."

Alright, with that admonishment out of the way, let's turn our attention to defining the final set of elementary operations we'll need in tiddlylisp.

Returning an expression as a value:(quote exp) returns the expression exp as its value, without evaluating exp. This is most clearly demonstrated by an example,

tiddlylisp> (quote (+ 1 2))
(+ 1 2)

as opposed to the value of (+ 1 2), which of course is 3. Another example, just to reinforce the point that quote returns an expression literally, without evaluating it,

tiddlylisp> (define x 3)
tiddlylisp> (quote x)
x

not the value of x, which of course is 3. Note also that quoting doesn't evaluate nested subexpressions, e.g.,

tiddlylisp> (quote ((+ 1 2) 3))
((+ 1 2) 3)

quote can also be used to return lists which aren't valid Lisp expressions at all, e.g.,

tiddlylisp> (quote (1 2 3))
(1 2 3)

It's worth emphasizing that (1 2 3) isn't a valid Lisp expression, since 1 is neither a procedure nor a special form - if you enter (1 2 3) into the tiddlylisp interpreter, it will give an error. But what quote lets us do is use the list (1 2 3) as data. For example, we could use it to store the (1 2 3) in a variable,

tiddlylisp> (define x (quote (1 2 3)))
tiddlylisp> x
(1 2 3)

In this way, quote lets us work with lists as data structures.

Why does Lisp have quote? Most other computer programming languages don't have anything like it. The reason Lisp has it is because Lisp allows you to treat code as data, and data as code. So, for example, an object such as (+ 1 2) can be treated as code, i.e., as a Lisp expression to be evaluated, which is what we've been doing up until the current discussion. But (+1 2) can also potentially be treated as data, i.e., as a list of three objects. This ability to treat code and data on the same footing is a wonderful thing, because it means you can write programs to manipulate programs. But it also creates a problem, which is that we need to be able to distinguish when an expression should be treated as data, and when it should be treated as code. quote is a way of distinguishing between the two. It's similar to the way most languages use escape characters to deal with special strings such as ". quote is a way of saying "Hey, what follows should be treated as data, not evaluated as Lisp code". And so quote lets us escape Lisp code, so we can take an expression such as (+ 1 2) and turn it into data: (quote (+ 1 2). In this way, quote allows us to define Lisp expressions whose values are arbitrary Lisp expressions.

Up until now we've been focused on using Lisp to do simple arithmetic operations, and as a result we haven't needed quote. But when we get to Lisp-as-Maxwell's-equations we're going to be increasingly focused on using Lisp to manipulate Lisp code, and as a result we'll be making frequent use of quote. For this reason it's helpful to introduce shorthand for quote. In most Lisps, the conventional shorthand is to use 'exp to denote (quote exp), i.e. if exp is some Lisp expression, then the value of 'exp is just the expression exp itself. Unfortunately, using the shorthand 'exp complicated the parsing in the tiddlylisp interpreter more than I wanted. So I decided to instead use a different (and, I emphasize, unconventional) shorthand for (quote exp), namely (q exp). So our examples of quote above can be shortened to,

Testing whether the value of an expression is atomic: As I noted above, (atom? exp) returns True if the value of the expression exp is atomic, and otherwise returns False. We've already discussed the following example,

tiddlylisp> (atom? (+ 1 2))
True

but it's illuminating to see it in tandem with the following example, which illustrates also the use of quote,

tiddlylisp> (atom? (q (+ 1 2)))
False

As above, the first example returns True because while (+ 1 2) is not atomic, its value, 3, is. But the second example returns False because the value of (q (+ 1 2)) is just (+ 1 2), which is a list, and thus not atomic.

By the way, I should mention that atom? is not a built-in procedure in the dialect of Lisp that tiddlylisp based on, Scheme. I've built atom? into tiddlylisp because, as we'll see, the analogous operation is used many times in the code on page 13 of the LISP 1.5 Programmer's Manual. Of course, an atom? procedure is easily defined in Scheme, but for the purposes of understanding the code on page 13 it seemed most direct to simply include atom? as a built-in in tiddlylisp.

Testing whether two expressions both evaluate to the same atom (or the empty list):(eq? exp1 exp2) returns True if the values of exp1 and exp2 are both the same atom, or both the empty list, (). (eq? exp1 exp2) returns False otherwise. Note that if exp1 and exp2 have the same value, but are not atomic or the empty list, then (eq? exp1 exp2) returns False. For example,

As for atom? my explanation of eq? is not quite the same as in standard Scheme, but instead more closely matches the function eq defined in the LISP 1.5 Programmer's Manual.

Getting the first item of a list:(car exp) returns the first element of the value of exp, provided the value of exp is a list. Otherwise (car exp) is undefined. For example,

tiddlylisp> (car (q (+ 2 3)))
+
tiddlylisp> (car (+ 2 3))

The first of these two behaves as expected: the value of (q (+ 2 3) is just the list (+ 2 3), and so car returns the first element, which is +. In the second, though, car is not defined, and tiddlylisp returns an error message, which I've elided. The reason it returns an error message is because the value of (+ 2 3) is 5, which is not a list, and so car is undefined. In a similar way, suppose we try

tiddlylisp> (car (1 2 3))

Again, we get an error message. The reason is that car is being applied to the value of (1 2 3), considered as a Lisp expression. And, of course, that value is undefined, since 1 is neither a procedure nor a special form. The right way to do the above is to quote the list first,

tiddlylisp> (car (q (1 2 3)))
1

Once again, we see how quote is used to make it clear to tiddlylisp that we're dealing with data, not code to be evaluated.

Getting the rest of a list:(cdr exp) returns a list containing all but the first element of the value of exp. Of course, the value of exp must be a list, otherwise (cdr exp) is undefined. For example,

tiddlylisp> (cdr (q (1 2 3)))
(2 3)

According to Wikipedia the names car and cdr have their origin in some pretty esoteric facts about the early implementations of Lisp. The details don't much matter here - car stands for "contents of address part of register", while cdr stands for "contents of decrement part of register" - I just wanted to make the point that you could reasonably be wondering "Where on Earth did those names come from?!" A mnemonic I find useful in distinguishing the two is to focus on the difference between the names of the two procedures, which of course is just the middle letter - a or d - and to keep in mind that a comes befored in the alphabet, just as car extracts the element of a list that comes before the remainder of the list, as given to us by cdr. Your taste in mnemonics may vary - if you don't like mine, it's still worth taking a minute or two to come up with some trick for remembering and distinguishing car and cdr. Of course, after a little practice you get used to them, and you won't need the mnemonic any more, but at first it's helpful.

Appending an item at the start of a list: Provided the value of exp2 is a list, then (cons exp1 exp2) returns a list containing the value of exp1 as its first element, followed by all the elements of the value of exp2. For example,

tiddlylisp> (cons 1 (q (2 3)))
(1 2 3)

Testing whether an expression evaluates to the empty list:(null? exp) returns True if exp evaluates to the empty list, (), and otherwise returns False. For example,

tiddlylisp> (null? (cdr (q (1))))
True

since (cdr (q (1))) returns the empty list.

Conditionals:(cond (p1 e1)...(pn en)): This starts by evaluating the expression p1. If p1 evaluates to True, then evaluate the expression e1, and return that value. If not, evaluate p2, and if it is True, return the value of e2, and so on. If none of the pj evaluates to True, then the (cond ...) expression is undefined.

That's all the Lisp we're going to need to write our version of Lisp-as-Maxwell's-equations, i.e., our version of the code on page 13 of the LISP Manual! In fact, as we'll see shortly, it's more than we need - I included a few extra features so that we could work through examples like sqrt, and also simply for fun, to make tiddlylisp a richer and more interesting language to play with. Of course, what I've described is merely a subset (with some variations) of our chosen dialect of Lisp (Scheme), and there are important missing ideas. To learn more about Lisp or Scheme, please consult the suggestions for further reading at the end of this essay.

An interpreter for Lisp

Now that we've worked through the basic elements of Lisp, let's write a simple Lisp interpreter, using Python. The interpreter we'll write is based on Peter Norvig's lispy interpreter, and I highly recommend you read through his explanation. I've given the program we'll discuss a separate name - tiddlylisp - so as to make it easy to refer to separately from Norvig's interpreter, but please keep in mind that most of the code we're discussing is Norvig's. However, we're going to examine the code from a different angle than Norvig: we're going to take a bit more of a bottom-up computer's-eye view, looking at how the code executes.

We'll look at the code piece by piece, before putting it all together. Let's start with the interactive interpreter, which is run when you start up. We implement this with a Python procedure called repl, meaning to read some input, evaluate the expression, print the result of the evaluation, and then loop back to the beginning. This is also know as the read-eval-print loop, or REPL. Here's the repl procedure, together with a procedure for handling errors when they occur:

The core of repl is in the try clause, and we'll get to how that works shortly. Before we look at that, note that if the user presses Ctrl-C, it raises the KeyboardInterrupt exception, which causes the program to exit. If an error occurs during the try block -- say, due to a syntax error in the Lisp expression being parsed, or due to a bug in tiddlylisp itself - then some other exception will be raised. Tiddlylisp doesn't deal very well with errors - it simply announces that an error has occurred, and prints the Python stack trace, which may give you a few hints about what's gone wrong, but which is obviously a long way short of truly informative error handling! After printing the stack trace, tiddlylisp simply returns to the prompt. This type of error handling could easily be improved, but we're not going to invest any effort in this direction.

Let's look at the try clause. It begins by taking input at the prompt, and then passing it to the function parse, which converts the string entered by the user into an internal representation, i.e., a data structure that's more convenient for our Python program to work with than a string. Here's an example which shows how it works:

parse("(* (+ 7 12) (- 8 6))") = ["*", ["+", 7, 12], ["-", 8, 6]]

In other words, nested Lisp lists become Python sublists, and things like procedures and numbers become elements in a Python list.

We'll look shortly at how parse is implemented. For now, let's finish understanding how repl works. The output of parse is passed to the function eval, which evaluates the internal representation of the expression entered by the user. Provided no error occurs, the result is returned in val. However, val is in the format of the internal representation, and so we need to convert it from that internal representation back into a printable Lisp expression, using to_string.

At this point, we've got three functions to understand the details of: parse, eval and to_string. I'll explain them out of order, starting with parse and to_string, since they're extremely similar. Then we'll get to eval.

Alright, let's understand how parse works. Without further ado, here's the code; for the explanation, see below:

The parsing is easy to understand. tokenize first inserts spaces on either side of any parentheses, and then splits the string around spaces, returning a list of tokens, i.e., the non-space substrings. read_from takes that list and removes the parentheses, instead nesting sublists as indicated by the original parenthesis structure. And, finally, atom turns tokens into Python ints, floats or Symbols (strings, by definition), as appropriate.

That's all there is to parse. If tiddlylisp were a little more powerful then parse would need to be more complex. For example, if we allowed strings as first-class objects in the language, then it would not work to tokenize by splitting around spaces, since that would risk splitting a string into separate tokens. This is the kind of thing that'd be fun to include in an extended version of tiddlylisp (and I've included it as a problem later in this section), but which we don't need in a first version.

Let's look now at how to_string works. It's much simpler, quickly undoing the steps taken in parsing:

In other words, if the internal representation of the expression is not a list, then return an appropriate stringified version (this takes care of the fact that we may have ints, floats and, as we shall see, Booleans in the internal representation). If it is a list, then return whatever we get by applying to_string to all the elements of that list, with appropriate delimiting by whitespace and parentheses.

At this point, the main thing we need to complete tiddlylisp is the eval function. Actually, that's not quite true: as we discussed earlier, tiddlylisp also keeps track of a global environment (and possibly one or more inner environments), to store variable and procedure names, and their values. eval is going to make heavy use of the environments, and so it helps to look first at how environments are defined. They're pretty simple: an environment has a bunch of keys, representing the names of the variables and procedures in that environment, and corresponding values for those keys, which are just the values for the variables or procedures. An environment also keeps track of its outer environment, with the caveat that the global environment has an outer environment set to Python's None. Such environments are easily implemented as a subclass of Python dictionaries:

As you can see, the only modifications of the dictionary class are that: (1) an environment also keeps track of its own outer environment; and (2) an environment can determine whether a variable or procedure name appears in its list of keys, and if it doesn't, then it looks to see if it's in its outer environment. As a result, the find method returns the innermost environment where the variable or procedure name appears.

Note, incidentally, that the environment doesn't distinguish between variable and procedure names. Indeed, as we'll see, tiddlylisp treats user-defined procedures and variables in the same way: a procedure is a variable which just happens to take the value of a lambda expression as its value.

Tiddlylisp starts off operating in a particular global environment, and this too must be defined by our program. We'll do this by creating an instance of the class Env, and calling a function to add some built-in procedure definitions and variables:

Incidentally, in tiddlylisp's version of add_globals I decided to strip out many of the built-in procedures which Norvig includes in lispy's global environment - it's instructive to look at Norvig's code for add_globals to see just how easy it is to add more built-in procedures to tiddlylisp. If you want to do some exploratory programming with tiddlylisp then you should probably copy some of Norvig's additional built-in procedures (and perhaps add some of your own). For us, though, the above procedures are enough.

One notable feature of the global environment is the variables named True and False, which evaluate to Python's Boolean True and False, respectively. This isn't standard in Scheme (or most other Lisps), but I've done it because it ensures that we can use the strings True and False, and get the appropriate internal representation.

With the global environment set up, we can now understand how eval works. The code is extremely simple, simply enumerating the different types of expressions we might be evaluating, and reading from or modifying the environment, as appropriate. It's worth reading (and rereading) the code in detail, until you understand exactly how eval works. I also have a few comments at the end. Here's the code:

Mostly this is self-explanatory. But allow me to draw your attention to how Norvig deals with anonymous procedure definitions using lambda. When I first examined his code I wondered how he'd cope with this, and expected it would be quite complex. But as you can see, it is extremely simple: lambda expressions evaluate to the appropriate anonymous Python function, with a new environment modified by the addition of the appropriate variable keys, and their values. Beautiful!

Tiddlylisp is essentially complete at this point. It's convenient to finish off the program by providing two ways of running tiddlylisp: either in an interactive interpreter mode, i.e., the REPL, or by loading a tiddlylisp program stored in a separate file. To start the REPL, we'll simply run python tiddlylisp.py. To load and execute a file, we'll run python tiddlylisp.py filename. After execution, we'd like to be dropped into the REPL so we can inspect results and do further experiments. The main complication in doing this is the need to load tiddlylisp code which is split over multiple lines. We do this by merging lines until the number of opening and closing parentheses match. Here's the code - it's best to start at the bottom, with the code immediately after if __name__ == "__main__":

import sys
def load(filename):
"""
Load the tiddlylisp program in filename, execute it, and start the
repl. If an error occurs, execution stops, and we are left in the
repl. Note that load copes with multi-line tiddlylisp code by
merging lines until the number of opening and closing parentheses
match.
"""
print "Loading and executing
f = open(filename, "r")
program = f.readlines()
f.close()
rps = running_paren_sums(program)
full_line = ""
for (paren_sum, program_line) in zip(rps, program):
program_line = program_line.strip()
full_line += program_line+" "
if paren_sum == 0 and full_line.strip() != "":
try:
val = eval(parse(full_line))
if val is not None: print to_string(val)
except:
handle_error()
print "\nThe line in which the error occurred:\n
break
full_line = ""
repl()
def running_paren_sums(program):
"""
Map the lines in the list program to a list whose entries contain
a running sum of the per-line difference between the number of '('
parentheses and the number of ')' parentheses.
"""
count_open_parens = lambda line: line.count("(")-line.count(")")
paren_counts = map(count_open_parens, program)
rps = []
total = 0
for paren_count in paren_counts:
total += paren_count
rps.append(total)
return rps
if __name__ == "__main__":
if len(sys.argv) > 1:
load(sys.argv[1])
else:
repl()

That completes the code for tiddlylisp! A grand total of 153 lines of non-comment, non-whitespace code. Here it all is, in one big block (commented and slightly reordered), so you can see how the pieces fit together:

Problems

Modify tiddlylisp so that the + procedure can be applied to any number of arguments, e.g., so that (+ 1 2 3) evaluates to 6.

Earlier we implemented a square root procedure in tiddlylisp. Can you add it directly to tiddlylisp, using the Python math module's sqrt function?

In our earlier implementation of the sqrt procedure we discussed the ordering of the lines of code, and whether it's okay to define a procedure in terms of some other yet-to-be-defined procedure. Examine the code for tiddlylisp, and explain why it's okay for a procedure such as sqrt to be defined in terms of a procedure such as sqrt-iter which isn't defined until later. Try doing the same thing with variables, e.g., try running (define x y) followed by (define y 1). Does this work? If so, why? If not, why not?

Modify tiddlylisp so that when applied to one argument the - procedure simply negates it, e.g., (- 2) returns -2, while - still computes differences when applied to two arguments.

Is it possible to write a pure tiddlylisp procedure minus so that (minus x) returns -x, while (minus x y) returns x-y?

In the discussion where we introduced cond I stated that (cond (p1 e1)...(pn en)) is undefined when none of the expressions p1...pn evaluate to True. What does Tiddlylisp return in this case? Can you think of a better way of dealing with this situation?

Can you add support for strings to tiddlylisp?

When I first examined Norvig's code for lispy, I was surprised by just how much I learned from his code. Of course, I expected to learn quite a bit - I am just a beginner at Lisp - but what I learned greatly exceeded my expectations. Why might writing an interpreter deepen our understanding of a programming language? I think the answer has to do with how we understand abstractions. Consider the way I first explained the concept of Lisp environments, early in this essay: I gave a general discussion of the concept, and then related it to several of the examples we were working through. This is the usual way we cope with abstractions when learning (or teaching) a language: we make those abstractions concrete by working through code examples that illustrates the consequences of those abstractions. The problem is that although I can show you examples, the abstraction itself remains ephemeral.

Writing an interpreter is a way of making a programming language's abstractions concrete. I can show you a million examples illustrating consequences of the Lisp environment, but none will have quite the same concrete flavour as the code for our Python Lisp interpreter. That code shows explicitly how the environment can be represented as a data structure, how it is manipulated by commands such as define, and so on. And so writing an interpreter is a way of reifying abstractions in the programming language being interpreted.

Problems

I gave the example of the environment as a Lisp abstraction which is made more concrete when you understand the code for the interpreter. Another example of an abstraction is errors in code. Can you improve tiddlylisp's error handling so that we get something more informative than a Python stack trace when something goes wrong? One suggestion for how to do this is to identify two (or more) classes of error that may occur in tiddlylisp programs, and to modify the interpreter so it catches and gracefully handles those error classes, printing informative error messages.

On Peter Norvig's webpage describing his interpreter, a few commenters take him to task for writing his interpreter in Python. Here's an example to give you the flavour of these comments:

This code looks very nice, but i think that implementing a Lisp Interpreter in Python is some kind of cheating. Python is a high-level language, so you get very much for free.

You are right -- we are relying on many features of Python: call stack, data types, garbage collection, etc. The next step would be to show a compiler to some sort of assembly language. I think either the Java JVM or the Python byte code would be good targets. We'd also need a runtime system with GC. I show the compiler in my PAIP [Paradigms of Artificial Intelligence] book.

The commenter and Norvig are right, in some sense. But there's also a sense in which the Python interpreter achieves something that would not be achieved by a program that compiled Lisp to the Java JVM, or to assembler, or some other target closer to the bare metal. That's because of all programming languages, Python is one of the closest to an ordinary human language. And so writing a Lisp interpreter in Python is an exceptionally clear way of explaining how Lisp works to a human who doesn't yet understand the core concepts of Lisp. Insofar as I can guess at Norvig's intention, I believe the code for his interpreter is primarily intended to be read by humans, and the fact that it can also be read by a computer is a design constraint, not the fundamental purpose [3].

It seems to me that the kind of comment above arises because there are really three natural variations on the question "How to explain Lisp?". All three variations are interesting, and worth answering; all three have different answers. The first variation is how to explain Lisp to a person who doesn't yet know Lisp. As I've just argued, a good answer to this question is to work through some examples, and then to write a simple Python interpreter. The second variation is how to explain Lisp to a machine. That's the question the commenter on Norvig's blog is asking, and to answer that question nothing beats writing a Lisp interpreter (or compiler) that works close to the bare metal, say in assembler, requiring you to deal explicitly with memory allocation, garbage collection, and so on.

But there's also a third variation on the question. And that's how best to explain Lisp to someone who already understands the core concepts of Lisp. That sounds paradoxical: doesn't such a person, by definition, already understand Lisp? But it's not paradoxical at all. Consider the following experience which many people have when learning (or teaching) mathematics. The best way to explain a mathematical idea to someone new to the idea is using their old language and their old way of looking at the world. This is like explaining Lisp by writing a Python interpreter. But once the person has grasped a transformative new mathematical idea, they can often deepen their understanding by re-examining that idea within their new way of looking at the world. That re-examination can help crystallize a deeper understanding. In a similar way, while writing a Lisp interpreter in Python may be a good way of explaining Lisp to a person who doesn't yet understand Lisp, someone who grasps the core ideas of Lisp may find the Python interpreter a little clunky. How should we explain Lisp within the framework of Lisp itself? One answer to that question is to use Lisp to write a Lisp interpreter. It's to that task that we now turn.

Lisp in Lisp

How should we write a Lisp interpreter in Lisp? Let's think back to what Alan Kay saw at the bottom of page 13 of the LISP Manual:

Although it's written in a different notation than we've used, this is Lisp code. In fact, it's the core of a Lisp interpreter written in Lisp: the procedure evalquote takes a Lisp expression as input, and then returns the value of that expression. In this section we're going to use tiddlylisp to write an analogue to evalquote (we'll change the name to eval). Of course, such a procedure is not really a full interpreter - we won't have a read-eval-print loop, for one thing - but it's not difficult to extend our code to a full interpreter (it requires a few additions to tiddlylisp, too). For this reason, in what follows I'll refer to our eval procedure as an "interpreter", even though it's more accurate to say that it's the core of an interpreter. I haven't made the extension to a full interpreter here, partly because I don't want to lengthen an already long essay, but mostly because I want to stick to the theme of the "Maxwell's equations of software". For the same reasons, I've also limited our eval to interpreting only a subset of tiddlylisp, omitting the arithmetical operations and concentrating instead on procedures for manipulating lists.

My treatment in this section is based on a beautiful essay (postscript) by Paul Graham, in which he explains what the original designer of Lisp, John McCarthy, was up to in the paper where he introduced Lisp. In his essay, Graham writes a fully executable Lisp interpreter in one of the modern dialects of Lisp, Common Lisp, and I've based much of my code on Graham's. Perhaps the main difference in my treatment is that while Graham's eval is written to be run under Common Lisp, our eval is executable in tiddlylisp, an interpreter for Lisp that we've written ourselves (with lots of help from Peter Norvig!) So even though the code is very similar, the perspective is quite diferent, and I think we gain something from this different perspective.

The code we'll write is longer than what you see on page 13 of the LISP Manual. The reason is that the code on page 13 was not actually self-contained, but made use of several procedures defined earlier in the LISP Manual, and we need to include those procedures. The final result is still only a little over a page of code. Let's start by defining a few of those helper procedures.

(pair exp1 exp2) returns a two-element list whose elements are the value of exp1 and the value of exp2:

tiddlylisp> (pair 1 2)
(1 2)
tiddlylisp> (pair (+ 1 2) 1)
(3 1)

Here's the tiddlylisp code for pair:

(define pair (lambda (x y) (cons x (cons y (q ()) ))))

Note that my use of pair is somewhat unconventional - the more usual approach in Lisp is to use (list exp1 exp2 exp3...) to construct a list whose values are just the values of the respective expressions. The reason I haven't done this is because tiddlylisp doesn't allow us to define Lisp procedures with a variable number of arguments. Note also that the procedure pair that I've defined should not be confused with one of Scheme's standard procedures, pair?, which has a different purpose, and which we won't use in the current essay.

Problems

Can you modify tiddlylisp so that (list exp1 exp2 exp3...) does indeed return a list whose values are just the values of the respective expressions?

I'll now introduce a class of helper procedures which are concatenations of two or more applications of car or cdr. An example is the procedure cdar, which applies car first, followed by cdr, that is, (cdar exp) has the same value as (cdr (car exp)). The notation cdar is a mnemonic, whose key elements are the middle two letters, d and a, indicating that cdar is what you get when you apply (in reverse order) cdr and car. You might wonder why it's reverse order - the answer is that reverse order corresponds to the visual syntactic order, that is, the order from left-to-right that the procedures appear in the expression (cdr (car exp)).

As another example, the procedure caar is defined so that (caar exp) has the same value as (car (car exp)). In our eval it'll be helpful to use several such procedures:

Our next helper procedure is called pairlis. (pairlis exp1 exp2) takes expressions exp1 and exp2 whose values are lists of the same length, and returns a list which is formed by pairing the values of corresponding elements. For example,

We'll call a list of pairs such as that produced by pairlis an association list. It gets this name from our final helper procedure, the assoc procedure, which takes an association list and treats it as a lookup dictionary. The easiest way to explain what this means is through an example,

In other words, assoc looks for the key 2 as the first entry in one of the pairs in the list which is the value of a. Once it finds such a pair, it returns the second element in the pair.

Stated more abstractly, suppose the expression exp1 has a value which appears as the first entry in one of the pairs in the association list which is the value of exp2. Then (assoc exp1 exp2) returns the second entry of that pair.

After all that explanation, the code for assoc is extremely simple, simpler even than pairlis:

I won't explain how assoc works, but if you're looking for a good exercise in applying caar and similar procedures, then it's worth spending some time to carefully understand how assoc works.

With all these helper procedures in place, we can now write our equivalent to the code on page 13 of the LISP Manual. This includes both the core procedure, eval, together with a couple of extra helper procedures, evcon and evlis. Here's the code:

Before we examine how eval works, I want to give you some examples of eval in action. If you want, you can follow along with the examples by first loading the program defining eval into tiddlylisp (the full source is below), and then typing the examples into the interpreter.

To understand how to use eval in examples, we need to be clear about the meaning of its arguments. e is a Lisp expression whose value is the Lisp expression that we want to evaluate with eval. And a is a Lisp expression whose value is an association list, representing the environment. In particular, the first element of each pair in a is the name of a variable or procedure, and the second element is the value of that variable or procedure. I'll often refer to a just as the environment.

Suppose, for example, that we wanted to use eval to evaluate the expression (car (q (1 2))). We'll assume that we're evaluating it in the empty environment, that is, no variables or extra procedures have been defined. Then we'd need to pass eval expressions with values (car (q (1 2))) and (). We can do this by quoting those values:

tiddlylisp> (eval (q (car (q (1 2)))) (q ()))
1

As you can see, we get the right result: 1.

I explained in detail how to build up the expression (eval (q (car...) evaluated above. But if we hadn't gone through that explanation, then the expression would have appeared quite of complicated, with lots of quoting going on. The reason is that eval is evaluating an expression which is itself the value of another expression. With so much evaluation going on it's no wonder there's many q's floating around! But after working carefully through a few examples it all becomes transparent.

Here's an example showing how to use variables in the environment:

tiddlylisp> (eval (q (cdr x)) (q ((x (1 2 3)))))
(2 3)

Unpacking the quoting, we see that it's evaluating the expression (cdr x) in an environment with a variable x whose value is (1 2 3). The result is, of course, (2 3).

Here's an example showing how to use a procedure which has been defined in the environment:

In other words, the environment stores a procedure cddr whose value is (lambda (x) (cdr (cdr x))), and eval returns the result of applying cddr to an expression whose value is (1 2 3 4 5). Of course, this is just (3 4 5).

We can also use eval to define and evaluate an anonymous procedure, in this case one that has the same effect as cadr:

A significant drawback of eval is that it has a pretty limited Lisp vocabulary. You can see this by running:

tiddlylisp> (eval (q (eq? 1 1)) (q (())))

The first line looks like perfectly valid Lisp - in fact, it is perfectly valid Lisp. The problem is that eval doesn't recognize 1 - at the level of sophistication we're working it really only understands lists, variables, and procedures. So what it tries to do is treat 1 as a variable or procedure to look up in the environment, a. But 1 isn't in the environment, which is why there's an error message.

Fixing this problem by modifying eval isn't terribly difficult [4]. However, to stay close to the LISP Manual, I'll leave this as is. A kludge to get around this issue is to add 1 as a key in the environment. For example, we can use:

This is exactly as expected. We didn't see this problem in our earlier examples of eval, simply because they involved list manipulations which didn't require us to evaluate numbers such as 1. Incidentally, here's an amusing variation on the above kludge:

tiddlylisp> (eval (q (eq? 1 2)) (q ((1 1) (2 1))))
True

In other words, if we tell our interpreter emphatically enough that 1 = 2 then it will start to believe it!

Just to put eval through its paces, let's add a bundle of tests of basic functionality. It's not an exhaustive test suite, but at least checks that the basic procedures are working as we expect. You don't need to read through the following test code in exhaustive detail, although you should read at least the first few lines, to get a feeling for what's going on. Note that in a few of the lines we need to add something like 1 or 2 to the environment, in order that eval be able to evaluate it, as occurred in the example just above.

In tiddlylisp, perhaps the easiest way to use this test code is to append it at the bottom of the file where we define eval. Then, when we load that file into memory, the tests run automatically. If everything is working properly, then all the tests should evaluate to True.

How does eval work? Looking back at the code, we see that it's just a big cond statement, whose value is determined by which of various conditions evaluate to True. The cond statement starts off:

(cond
((atom? e) (assoc e a))
...

To understand what this accomplishes, it is helpful to remember that what we're most interested in is the value of e, not e itself. Let's use e' to denote the value of e, i.e., e' is the Lisp expression that we actually want to evaluate using eval. Then what the condition above does is check whether e' is atomic, and if so it returns the value of the corresponding variable or procedure in the environment, exactly as we'd expect.

Let's look at the next line in the big outer conditional statement:

((atom? (car e))

At this stage, we know that e' isn't atomic, since we already checked for that, and so e' must be a list. This line checks to see whether the first element of e' is itself an atom. If it is, then there are multiple possibilities: it could be a special form, such as quote, or a built-in procedure, such as car, or else a procedure that's defined in the environment. To check which of these possibilities is the case, we evaluate another (nested) conditional statement. This just checks off the different cases, for instance the first line of the nested conditional checks to see if we're applying the procedure car, and if so proceeds appropriately,

((eq? (car e) (q car)) (car (eval (cadr e) a)))

In other words, if the first symbol in e' is car, then extract whatever expression is being passed to car, using (cadr e), then evaluate that expression using (eval (cadr e) a), and finally extract the first element, using (car (eval...)). That's exactly what we'd expect car to do. Most of the rest of this nested conditional statement works along similar lines, as you can check yourself. The final line is interesting, and deserves comment:

(True (eval (cons (assoc (car e) a) (cdr e)) a))))

This line is evaluated when the expression e' does not start with a special form or built-in procedure, but instead starts with the name of a procedure defined in the environment. To understand what is returned, note that (car e) retrieves the name of the procedure, so (assoc (car e) a) can retrieve the procedure from the environment, and then (cons (assoc (car e) a) (cdr e)) appends the arguments to the procedure. The whole thing is then evaluated. It's all quite simple and elegant!

Moving back into the outer cond statement, the final condition is as follows:

This occurs when evaluating a quoted expression of the form ((lambda (x...|) exp). The first line simply checks that we are, indeed, seeing a lambda expression. The caddar e extracts the expression exp from the body of the lambda expression. We evaluate this in the context of an environment which has modified by appending some new variable names (extracted with cadar e), using pairlis to pair them with their values, which are evaluated using evlis (which you can work through yourself). Once again, it's all quite simple and neat - a fact which speaks to the marvellous elegance of the design presented in the LISP Manual (and, ultimately, due to John McCarthy).

It won't have escaped your attention that our Lisp eval is very similar to the eval we wrote earlier in Python. Tiddlylisp is somewhat different to the dialect of Lisp our eval interprets, but the implementation is recognizably similar. It is a matter of taste, but I think the Lisp implementation is more elegant. It's true that the Lisp code is superficially a little more complex - it relies more on concepts outside our everyday experience, such as the procedures caar, cadar, and so on. But it makes up for that by possessing a greater conceptual economy, in that we are using concepts such as car, cdr and cond to write an interpreter which understands those very same concepts.

Here's the full code for our Lisp interpreter in tiddlylisp. You should append the test code given above, and save it all as a single file, eval.tl.

It's instructive to compare our eval to what Kay saw on page 13 of the LISP 1.5 Programmer's Manual:

Obviously, what we've written is longer than that half-page! However, as I mentioned earlier, that half-page omitted the code for helper procedures such as caar, append, and so on, which were defined earlier in the LISP Manual. A more direct comparison is to our code for the eval, evcon and evlis procedures.

If you compare our code to the LISP Manual, a few differences jump out. The most obvious is that the LISP Manual's evalquote, apply and eval have all been combined into one procedure. This is a form of organization I adopted from Paul Graham's eval, and it makes it much easier to see what is going on, in the outer cond. In particular, the outer cond has a very simple structure: (1) if the expression we're tring to evaluate is an atom, return its value; otherwise (2) the expression must be a list, so check to see if the first element is an atom, in which case it must be a special form or procedure, and should be evaluated appropriately (this is the inner cond); and otherwise (3) we must be dealing with a lambda expression.

Condition (3) is interesting. With the syntax we're using, the condition in step (3) could simply be expressed as True, not (eq? (caar e) (q lambda), since it's the only remaining possibility. This would, in some sense, simplify (and speed up) the code. However, it would also make it harder to understand the intent of the code.

Something that you may note was present in the LISP Manual but which is missing from our eval is the special form label. label was used in the LISP Manual to give names to procedures, and so that procedure definitions could refer recursively to themselves. It's only a couple of lines to add back in, but I haven't done so. If you'd like, it's a fun challenge to add this functionality back in, and so I've given this as a problem below.

Problems

How could you add a facility to eval so that procedure definitions can refer to themselves? If you're having trouble with this problem, you can get a hint by looking at the code from page 13 of the LISP Manual. A complete solution to the problem may be found in Paul Graham's essay about the roots of Lisp.

This is a nice little interpreter. However, it has many limitations, even when compared to tiddlylisp. It can't do basic arithmetic, doesn't cope with integers, much less more complicated data types, it doesn't even have a way of defineing variables (after all, it doesn't return a modified environment). Still, it already contains many of the core concepts of Lisp, and it really is an executable counterpart to what Alan Kay saw on page 13 of the LISP 1.5 Manual.

What can we learn from this interpreter for Lisp in Lisp? As an intellectual exercise, it's cute, but beyond that, so what? Let's think about the analogous question for Python, i.e., writing a Python function that can return the value of Python expressions. In some sense, solving this problem is a trivial one-liner, since Python has a built-in function called eval that is capable of evaluating Python expressions.

What if, however, we eliminated eval (and similar functions) from Python? What then? Well, a Python version of eval would be much more complicated than our Lisp eval. Python is much less regular language than Lisp, and this makes it much more complicated for it to deal with Python code. By contrast, Lisp has an extremely simple syntax, and is designed to manipulate its own code as data. This is all reflected in the simplicity of the interpreter above.

Beyond this, the code for eval is a beautiful expression of the core ideas of Lisp, written in Lisp. It's true that our eval implements a very incomplete version of Lisp, but with just a little elaboration we can add support for arithmetic, more advanced control structures, and so on - everything needed to make this an essentially complete basic Lisp. And so we need only a little poetic license to say that, just as with Maxwell's equations and electromagnetism, there is a sense in which if you can look at this compact little program and understand all its consequences, then you understand all that Lisp can do. And because Lisp is universal, that means that inside these few lines of code is all a computer can do - everything from Space Invaders to computer models of climate to the Google search engine. In that sense this elegant little program really is the Maxwell's equations of software.

Problems

Outline a proof that (eval (q (exp)) a) returns the value of exp in the environment a for all expressions exp and environments a if and only if the underlying Lisp interpreter is correct. This little theorem can be considered a formal way of stating that eval contains all of Lisp. The reason I ask for an outline proof only is that various elements in the statement aren't defined as well as they need to be to make this a rigorous result; still, a compelling outline proof is possible.

Extend the code given for eval so that you can implement a full read-eval-print loop. This will require you to extend tiddlylisp so that it can cope with input and output, and (perhaps) some sort of looping.

Having worked through eval.tl, it should now be easy to work through the first chapter of the LISP 1.5 Programmer's Manual. Download the LISP manual and work through the first chapter, including the code on page 13.

Problems for the author

Is it possible to modify the above Lisp-in-Lisp so that it interprets all of tiddlylisp? Note that this will require modification of tiddlylisp.

Acknowledgements

Footnote

[1] I'm paraphrasing, since this was 17 years ago, but I believe I've reported the essence of the comments correctly. I've taken one liberty, which is in supplying my own set of examples (antennas, motors, and circuits), since I don't recall the examples he gave. Incidentally, his comments contain a common error that took me several years to sort out in my own thinking: Maxwell's equations actually don't completely specify electromagnetism. For that, you need to augment them with one extra equation, the Lorentz force law. It is, perhaps, unfair to characterize this as an error, since it's a common useage to equate Maxwell's equations with electromagnetism, and I've no doubt my professor was aware of the nuance. Nonetheless, while the useage is common, it's not correct, and you really do need the Lorentz force law as well. This nuance creates a slight quandary for me in this essay. As the title of the essay suggests, it explores a famous remark made by Alan Kay about what he called "the Maxwell's equations of software", and I presume that in making this remark Kay was following the common useage of equating the consequences of Maxwell's equations with the whole of electromagnetism. My quandary is this: on the one hand I don't wish to perpetuate this useage; on the other hand I think Kay's formulation is stimulating and elegant. So I'll adopt the same useage, but with the admonishment that you should read "Maxwell's equations" as synonomous with "Maxwell's equations plus the Lorentz force law". That set of five equations really does specify the whole of (classical) electromagnetism.

[2] I first encountered lambda in Python, not Lisp, but I believe I would have been perplexed for the same reason even if I'd first encountered it in Lisp.

[3] With a hat tip to Abelson and Sussman, who famously wrote "programs must be written for people to read, and only incidentally for machines to execute".

[4] The simplest solutions I can think of are: (1) to give eval the ability to determine when some key is not in the environment; or (2) to give eval the ability to recognize numbers. Both approaches seem to also require making some modifications to tiddlylisp.

Further reading

Lisp enjoys a plethora of insightful and beautifully written books and essays, many of them freely available online. This essay is, of course, based principally on Peter Norvig's essay on his basic Lisp interpreter, lispy. I've also drawn a few ideas from a followup essay of Norvig's which describes a more sophisticated Lisp interpreter. Both essays (and the accompanying code) are marvellously elegant, and well worth working through. Norvig's other works are also worth your time. The first three chapters of Norvig's book Paradigms of Artificial Intelligence Programming are an excellent introduction to Common Lisp.

The other principal inspiration for the current essay is Paul Graham's essay The Roots of Lisp (postscript file), where he explains John McCarthy's early ideas about Lisp. My essay may be viewed as an attempt to remix the ideas in Norvig's and Graham's essays, in an attempt to better understand Alan Kay's remark about Lisp-as-Maxwell's-Equations. I also recommend Graham's book "On Lisp", which contains an excellent discussion of Lisp macros and many other subjects. The book seems to be out of print, but thanks to Graham and the publisher Prentice Hall the text of the entire book is freely available online. Note that I am still working through "On Lisp", I have not yet read it to completion. The same is true of the books I mention below, by Seibel, and by Abelson and Sussmann.

Although "On Lisp" is a marvellous book, it's not written for people new to Lisp. To gain familiarity, I suggest working through the first three chapters of Norvig's book, mentioned above. If that's not available, then you should take a look at Peter Seibel's book Practical Common Lisp. It's freely available at the link, and gives an easily readable introduction to Common Lisp.

Finally, I must recommend the wonderful book by Abelson and Sussman on the Structure and Interpretation of Computer Programs. Among other things, it's an introduction to the Scheme dialect of Lisp, but it's about much more than that; it's about how to think about programming. It's a famous book, but for a long time I avoided looking at it, because I'd somehow picked up the impression that it was a little dry. I started reading, and found this impression was completely false: I was utterly gripped. Much of "On Lisp" and "Paradigms of Artificial Intelligence Programming" also have this quality. Abelson and Sussmann's book is freely available online.

Suppose I add a hyperlink from a webpage to a webpage . In principle, adding that single hyperlink changes the PageRank not just of those two pages, but potentially of nearly every other page on the web. For instance, if the CNN homepage http://cnn.com adds a link to the homepage of my site, http://michaelnielsen.org, then not only will that increase the PageRank of my homepage, it will also increase (though in a smaller way) the PageRank of pages my homepage links to, and then the pages they link to, and so on, a cascading network of ever-smaller changes to PageRank.

In this post I investigate how PageRank changes when the structure of the web changes. We’ll look at the impact of adding and removing multiple links from the web, and also at what happens when entirely new webpages are created. In each case we’ll derive quantitative bounds on the way in which PageRank can change.

I’ve no doubt that most or all of the results I describe here are already known – PageRank has been studied in great detail in the academic literature, and presumably the bounds I describe (or better) have been obtained by others. This post is for my own pleasure (and to improve my own understanding) in attacking these questions from first principles; it is, essentially, a cleaned up version of my notes on the problem, with the hope that the notes may also be of interest to others. For an entree into the academic literature on these questions, see here.

The post requires a little undergraduate-level linear algebra to follow, as well as some basic familiarity with how PageRank is defined and calculated. I’ve written an introduction to PageRank here (see also here, here, here and here). However, I’ve included a short summary below that should be enough to follow this post.

One final note before we get into the analysis: I understand that there are some people reading my blog who are interested in search engine optimization (SEO). If you’re from that community, then I should warn you in advance that this post doesn’t address questions like how to link in such a way as to boost the PageRank of a particular page. Rather, the results I obtain are general bounds on global changes in PageRank as a result of local changes to the structure of the web. Understand such global changes is of great interest if you’re trying to understand how a search engine using PageRank might work (which is my concern), but probably of less interest if your goal is to boost the PageRank of individual pages (SEO’s concern).

PageRank summary

PageRank is defined by imagining a crazy websurfer, who surfs the web by randomly following links on any given page. They interrupt this random websurfing occasionally to “teleport” to a page chosen completely at random from the web. The probability of teleporting in any given step is assumed to be around , in line with the original PageRank paper. The PageRank of a given webpage is defined to be the long-run probability of the surfer being on the page. A high PageRank means the page is important; a low PageRank means the page is not important.

One minor wrinkle in this definition of PageRank is what our websurfer should do when they arrive at a dangling webpage, i.e., a webpage with no outgoing links. Obviously, they can’t just select a link to follow at random – there’s nothing to select! So, instead, in this situation what they do is always teleport to a randomly chosen webpage. Another, equivalent way of thinking about this is to imagine that we add outgoing links from the dangling page to every single other page. In this imagined world, our crazy websurfer simply chooses a webpage at random, regardless of whether they are teleporting or not.

With the above definition of the PageRank of a page in mind, the PageRank vector is defined to be the vector whose components are the PageRanks of all the different webpages. If we number the pages then the component is the PageRank of page number 0, is the PageRank of page number 1, and so on. The PageRank vector satisfies the PageRank equation, , where is an matrix which we’ll call the PageRank matrix. The element of is just the probability that a crazy websurfer on page will go to page . This probability is if there is no link between pages and , i.e., it’s just the teleporation probability. The probability is if there is a link between page and , where is the number of outbound links from page . Note that in writing these formulae I’ve assumed that is not a dangling page. If it is, then we will use the convention that , i.e., the page is linked to every other page, and so .

In this post I’ll consider several scenarios where we imagine altering the structure of the web in some way – by adding a link, deleting a link, adding a page, and so on. I’ll use to denote the PageRank vector before the alteration, and to denote the PageRank vector after the alteration. What we’re interested in is understanding the change from to . The main quantity we’ll study is the total change in PageRank across all webpages. This quantity is derived from the norm for vectors, . We’ll drop the subscript in the norm notation , since we’ll only be using the norm, never the Euclidean norm, which is more conventionally denoted .

A key fact about the norm and the PageRank matrix is that when we apply the PageRank matrix to any two probability vectors, and , they are guaranteed to get closer by a factor of :

Intuitively, applying the PageRank matrix is like our crazy websurfer doing a single step. And so if we think of and as possible starting distributions for the crazy websurfer, then this inequality shows that these distributions gradually get closer and closer, before finally converging to PageRank. We’ll call this inequality the contractivity property for PageRank. I discuss the contractivity property and its consequences in much more detail in this post.

What happens when we add a single link to the web?

Suppose we add a single new link to the web, from a page to a page . How does the PageRank vector change? In this section we’ll obtain a bound on the total change, . We’ll obtain this bound under the assumption that isn’t a dangling page, i.e., a page with no outgoing links. As we’ll see, dangling pages create some complications, which we’ll deal with in a later section.

To obtain the bound on , let be the PageRank matrix before the new link is inserted, so the PageRank equation is , and let be the PageRank matrix after the new link is inserted, when the PageRank equation becomes . Note that we have . If we introduce a new matrix , then this equation may be rewritten:

Taking norms on both sides, and using the triangle inequality:

Using the contractivity property, , and so:

Rearranging, we obtain:

Up to this point we haven’t used in any way the fact that we’re adding merely a single link to the web. The inequality [*] holds no matter how we change the structure of the web, and so in later sections we’ll use [*] (or an analagous inequality) to analyse more complex situations.

To analyse what happens in the specific case when we add a single new link, we will compute . Recall that is the change between the two PageRank matrices. Suppose that starts with outgoing hyperlinks (where , reflecting the fact that is not a dangling page). By numbering pages appropriately, we can assume that is page number , that the new link is from page to page , and that before the link was inserted, the page was linked to pages . Under this numbering of pages, the only column of which changes in is the first column, corresponding to the new link from page to page . Before the link was added, this column had entries for the rows corresponding to pages through , i.e., for the outgoing links, and everywhere else. After the link is added, this column has entries for the rows corresponding to pages through , and everywhere else. Combining these facts and doing a little algebra we find that the entries in the first column of are:

The other columns of are all zero, because and don’t change in those other columns. Substituting for in [*] and simplifying, we see that [*] becomes:

This is a strong result. The inequality [**] tells us that the total PageRank vector changes at most in proportion to the PageRank of the page to which the outbound link is being added. Ordinarily, will be tiny – it’s a probability in an -element probability distribution, and typically such probabilities are of size . As a result, we learn that the total variation in PageRank will be miniscule in the typical case. [**] also tells us that the total variation in PageRank scales inversely with the number of outbound links. So adding an extra outbound link to a page which already has a large number of links will have little effect on the overall PageRank.

Incidentally, going carefully over the derivation of [**] we can see why we needed to assume that the page was not a dangling page. We get a hint that something must be wrong from the final result: the right-hand side of [**] diverges for a dangling page, since . In fact, during the derivation of [**] we assumed that ‘s first column had entries for the rows corresponding to pages through , and everywhere else. In fact, ‘s first column has entries everywhere in the case of a dangling page. We could fix this problem right now by redoing the analysis for the case, but I’ll skip it, because we’ll get the result for free as part of a later analysis we need to do anyway.

One final note: in the derivation of [**] I assumed that the existing links from page are to pages , and that the new link is to page . This presumes that all the links are to pages other than the original page, i.e., it’s not self-linking. Of course, in practice many pages are self-linking, and there’s no reason that couldn’t be the case here. However, it’s easy to redo the analysis for the case when the page is self-linking, and if we do so it turns out that we arrive at the same result, [**].

Problems

Verify the assertion in the last paragraph.

The inequality [**] bounds the change in terms of the initial PageRank of the linking page, . A very similar derivation can be used to bound it in terms of the final PageRank, . Prove that .

Problems for the author

Is it possible to saturate the bound [**]? My guess is yes (or close to, maybe within a constant factor), but I haven’t explicitly proved this. What is the best possible bound of this form?

We’ve obtained a bound on the variation in norm produced by adding a link to the web. We might also wonder whether we can say anything about the change in the PageRank of individual pages, i.e., whether we can bound ? This is the type of question that is likely to be of interest to people who run webpages, or to people interested in search engine optimization. Note that it’s really three separate questions: we’d like bounds on when (1) is the source page for the new link; (2) is the target page for the new link; and (3) is neither the source nor the target. Unfortunately, I don’t have any such bounds to report, beyond the obvious observation that in all three cases, , and so at least the bound on the right-hand side of [**] always applies. But it’d obviously be nice to have a stronger result!

Problems for the author

Can I find bounds on for the three scenarios described above? (Or perhaps for some different way of specifying the properties of ?, e.g., when is another page linked to by the source page.)

What happens when we add and remove multiple links from the same page?

Suppose now that instead of adding a single link from a page, we remove existing links outbound from that page, and add new links outbound from the page, for a total of links. How does the PageRank change? Exactly as in our earlier analysis, we can show:

where is the change in the PageRank matrix caused by the modified link structure. Assuming as before that the page is not a dangling page, we number the pages so that: (1) the links are outbound from page ; (2) the links to pages are preserved, before and after; (3) the links to pages are removed; and (4) the links to pages are new links. Then all the columns of are zero, except the first column, which has entries:

As a result, we have

To simplify the quantity on the right we analyse two cases separately: what happens when , and what happens when . In the case we get after some algebra

which generalizes our earlier inequality [**]. In the case we get

Observing that is equivalent to , we may combine these two inequalities into a single unified inequality, generalizing our earlier result [**]:

What about adding links to a dangling page?

Let’s come back to the question we ducked earlier, namely, how PageRank changes when we add a single link to a dangling page. Recall that in this case the crazy websurfer model of PageRank is modified so it’s as though there are really outgoing links from the dangling page. Because of this, the addition of a single link to a dangling page is equivalent to the deletion of links from a page that started with outgoing links. From the inequality [***], we see that:

To generalize further, suppose instead that we’d added links to a dangling page. Then the same method of analysis shows:

In general, we can always deal with the addition of links to a dangling page as being equivalent to the deletion of links from a page with outgoing links. Because of this, in the remainder of this post we will use the convention of always treating dangling pages as though they have outgoing links.

What happens when we add and remove links from different pages?

In the last section we assumed that the links were all being added or removed from the same page. In this section we generalize these results to the case where the links aren’t all from the same page. To do this, we’ll start by considering the case where just two pages are involved, which we label and . We suppose outbound links are added to page ( or ), and outbound links are removed. Observe that , where and are the changes in the PageRank matrix due to the modifications to page and page , respectively. Then a similar argument to before shows that:

Applying the triangle inequality we get:

And so we can reuse our earlier analysis to conclude that:

where and are just the number of links outbound from page , before and after the modifications to the link structure, respectively. This expression is easily generalized to the case where we are modifying more than two pages, giving:

This inequality is a quite general bound which can be applied even when very extensive modifications are made to the structure of the web. Note that, as discussed in the last section, if any of the pages are dangling pages then we treat the addition of links to that page as really being deleting links from a page with outgoing links, and so the corresponding term in the sum is . Indeed, we can rewrite the last inequality, splitting up the right-hand side into a sum over pages which are not dangling, and a sum over dangling pages to obtain:

Modified teleportation step

The original PageRank paper describes a way of modifying PageRank to use a teleportation step which doesn’t take us to a page chosen uniformly at random. Instead, some non-uniform probability distribution is used for teleportation. I don’t know for sure whether Google uses this idea, but I wouldn’t be suprised if they use it as a way of decreasing the PageRank of suspected spam and link-farm pages, by making them less likely to be the target of teleportation. It’s possible to redo all the analyses to date, and to prove that the bounds we’ve obtained hold even with this modified teleportation step.

Problems

Prove the assertion in the last paragraph.

What happens when we add a new page to the web?

So far, our analysis has involved only links added between existing pages. How about when a new page is added to the web? How does the PageRank change then? To analyse this question we’ll begin by focusing on understanding how PageRank changes when a single page is created, with no incoming or outgoing links. Obviously, this is rather unrealistic, but by understanding this simple case first we’ll be in better position to understand more realistic cases.

It’s perhaps worth commenting on why we’d expect the PageRank to change at all when we create a new page with no incoming or outgoing links. After all, it seems as though the random walk taken by our crazy websurfer won’t be changed by the addition of this new and completely isolated page. In fact, things do change. The reason is because the teleportation step is modified. Instead of going to a page chosen uniformly at random from pages, each with probability , teleportation now takes us to a page chosen uniformly at random from the full set of pages, each with probability . It is this change which causes a change in the PageRank.

A slight quirk in analysing the change in the PageRank vector is that while the initial PageRank vector is an -dimensional vector, the new PageRank vector is an -dimensional vector. Because of the different dimensionalities, the quantity is not even defined! We resolve this problem by extending into the -dimensional space, adding a PageRank of for the new page, page number . We’ll denote this extended version of by . We also need to extend so that it becomes an by matrix satisfying the PageRank equation for the extended . One way of doing this is to extend by adding a row of zeros at the bottom, and everywhere in the final column, except the final entry:

With the extended and it is easy to verify that the PageRank equation is satisfied. As in our earlier discussions, we have the inequality

where now . Computing , we have:

Substituting and simplifying, we obtain the bound:

Not surprisingly, this bound shows that creating a new and completely isolated page makes only a very small difference to PageRank. Still, it’s good to have this intuition confirmed and precisely quantified.

Exercises

Suppose we turn the question asked in this section around. Suppose we have a web of pages, and one of those pages is an isolated page which is deleted. Can you prove an analogous bound to [****] in this situation?

Problems for the author

How does the bound [****] change if we use a non-uniform probability distribution to teleport? The next problem may assist in addressing this question.

An alternative approach to analysing the problem in this section is to use the formula, where is the vector whose entries are the uniform probability distribution (or, more generally, the vector of teleportation probabilities), and is the matrix whose entry is if links to , and otherwise. This formula is well-suited to analysing the problem considered in the current section; without going into details, the essential reason is that the modified matrix has a natural block-triangular structure, and such block-triangular matrices are easy to invert. What is the outcome of doing such an analysis?

Adding multiple pages to the web

Suppose now that instead of adding a single page to the web we add multiple pages to the web. How does the PageRank change? Again, we’ll analyse the case where no new links are created, on the understanding that by analysing this simple case first we’ll be in better position to understand more realistic cases.

We’ll consider first the case of just two new pages. Suppose is the original -dimensional PageRank vector. Then we’ll use as before to denote the extension of by an extra entry, and to denote the extension of by two entries. We’ll use to denote the -dimensional PageRank vector after adding a single webpage, and use to denote the extension of by appending an extra entry. Finally, we’ll use to denote the -dimensional PageRank vector after adding two new webpages.

Our interest is in analysing . To do this, note that

Applying the triangle inequality gives

Observe that , since the extra appended to the end of both vectors makes no difference to the norm. And so the previous inequality may be rewritten

Now we can apply the results of last section twice on the right-hand side to obtain

More generally, suppose we add new pages to the web. Denote the final -dimensional PageRank vector by . Let denote the initial -dimensional PageRank vector, and let denote the -dimensional extension obtained by appending entries to . Then applying an argument similar to that above, we obtain:

Adding multiple pages, adding and removing multiple links

Suppose now that we add many new pages to the web, containing links to existing webpages. How does the PageRank change? Not surprisingly, answering this kind of question can be done in a straightforward way using the techniques described already in this post. To see this, suppose we add a single new webpage, which contains new links back to existing webpages. As before, we use to denote the new PageRank vector, to denote the old PageRank vector, and to denote the extension of obtained by appending a entry. We have

where is the matrix obtained by adding a row of s to the bottom of , and a column (except the bottom right entry) of values to the right of . We can write , where is the change in due to the added page, and is the change in due to the added links. Applying a similar analysis to before we obtain

We can bound the first term on the right-hand side by , by our earlier argument. The second term vanishes since is zero everywhere except in the last column, and ‘s final entry is zero. As a result, we have

i.e., the bound is exactly as if we had added a new page with no links.

Exercises

There is some ambiguity in the specification of and in the above analysis. Resolve this ambiguity by writing out explicit forms for and , and then verifying that the remainder of the proof of the bound goes through.

Suppose, instead, that the new link had been added from an existing page. Then the term above would no longer vanish, and the bound would become

where was the initial PageRank for the page to which the link was added, and was the initial number of links from that page.

Generalizing this analysis, it’s possible to write a single inequality that unites all our results to date. In particular, suppose we add new pages to the web. Denote the final -dimensional PageRank vector by . Let denote the initial -dimensional PageRank vector, and let denote the -dimensional extension obtained by appending entries to . Suppose we add outbound links to page , and remove outbound links from page . Then an analysis similar to that above shows that:

where, as before, the sum is over those pages which are non-dangling, and the sum is over dangling pages . Note that we can omit all the new pages from this latter sum, for the same reason the quantity vanished earlier in this section. This inequality generalizes all our earlier results.

Problems

Fill in the details of the above proof. To do so it helps to consider the change in PageRank in two steps: (1) the change due to the addition of new links to existing pages; and (2) the change due to the addition of new webpages, and links on those webpages.

Recomputing PageRank on the fly

Part of my reason for being interested in the questions discussed in this post is that I’m interested in how to quickly update PageRank during a web crawl. The most obvious way to compute the PageRank of a set of crawled webpages is as a batch job, doing the computation maybe once a week or once a month. But if you’re operating a web crawler, and constantly adding new pages to an index, you don’t want to have to wait a week or a month before computing the PageRank (or similar measure) of a newly crawled page. You’d like to compute – or at least estimate – the PageRank immediately upon adding the page to the index, without having to do an enormous matrix computation. Is there a way this can be done?

Unfortunately, I don’t know how to quickly update the PageRank. However, the bounds in this post do help at least a little in figuring out how to update the PageRank as a batch job. Suppose we had an extant multi-billion page crawl, and knew the corresponding values for PageRank. Suppose then that we did a mini-crawl to update the index, perhaps with a few million new pages and links. Suppose was the (known value of) the PageRank vector before the mini-crawl, and is the PageRank after the mini-crawl, which we are now trying to compute. Suppose also that is the new PageRank matrix. One natural way of updating our estimate for PageRank would be to apply repeatedly to (more precisely, its extension, ), in the hopes that it will quickly converge to . The bounds in this post can help establish how quickly this convergence will happen. To see this, observe that

where we used the PageRank equation to get the first equality, and the contractivity of PageRank to get the second inequality. This equation tells us that we can compute an estimate for by applying repeatedly to , and it will converge exponentially quickly. This is all standard stuff about PageRank. However, the good news is that the results in this post tell us something about how to bound , and frequently this quantity will be very small to start with, and so the convergence will occur much more quickly than we would a priori expect.

Problems for the author

Is there a better way of updating PageRank on the fly, rather than applying powers of ?

More in the Moorcock vein. It's easy to imagine the reaction of the critic, holding their nose at writing to formula. But you can turn that around, regarding Dent (and, more plausibly, Moorcock) as a student and theoretician of structure. And that's a pretty powerful point of view. Of course, word-by-word Dent is a poor […]

Fascinating both intrinsically, and for the commentary. The commentary first: part of the interest is from people who desire an easy way to write (or, more accurately, to have written). But there is also clearly a genuine interest on the part of many: what does this guy know that I don't about storytelling? You may […]

Kevin Kelly interviews Brian Eno. Slow to get going, but fascinating. Eno proposes "process, not product", says that it's his "ease of seduction" that means he often gets things first, talks about putting more "Africa" into computers, and generally makes many interesting comments.

Documentary of Wolfgang Steiner, one of the world's top ski-jumpers in the 1970s. The spine of the documentary is a sequence of extraordinary shots of Steiner's jumps, taken with a pair of high-speed cameras.

Remarkable survey of the cutting edge of surfing. We see the origins of tow-rope surfing (where surfers are pulled by jet skis into waves that are too big to paddle out to), the use of hydrofoil designs that put the board a foot or two _above_ the wave, and even the use of weather stations […]