Wednesday, August 31, 2005

NAME

lab 42 - channels in shell

NOTES

This is a quicky. I ripped this code off from Rob Pike's squint, an interpreter for his squeak language, a predecessor of limbo. It's an implementation of Eratosthenes's Sieve using communicating processes and channels, which I translated to Inferno shell. This was just to try out the channels in shell, since somehow I've overlooked them till now.

Monday, August 29, 2005

NAME

lab 41 - venti lite

NOTES

I've taken another look recently at venti and the ideas from the venti paper.

Venti is a data log and index. The index maps a sha1 hash of a clump of data to the offset of the clump in the log. Clumps are appended to the log after being compressed and wrapped with some meta-data. A sequence of clumps make an arena, and all the arenas, possibly spread over several disks, make up the whole data log. There is enough information in the data log to recreate the index, if neccessary.

The above is my understanding of Venti so far after reading the code and paper. There is a lot more complexity in it's implementation. There are details about the caches, the index, the compression scheme, the blocking and partitioning of disks, and so on. I will ignore these details for now. Although the whole of venti could be ported to Inferno, I want to look at it without getting bogged down in too many details too early.

Reasoning about Venti in the context of Inferno I tried to do some simple analog of Venti using Inferno sh(1). The two basic functions of Venti are the read of a clump using the hash, called a score, to locate it, and writing a clump getting a score in return. I created two sh functions, putclump and getclump.

It is easier to reuse than to reinvent. I use gzip for compression, puttar for packaging a clump, sha1sum to hash the clump, and dbm as the index.
Here's the code for putclump.

A file must be less than the Maxclumpsize. If I store a set of files I get a list of scores back. I can write this list to a file and write the list back with a different clump type: Pointertype0. Then I store the final score as a file with one score entry and call this the Roottype.

Sunday, August 28, 2005

first year anniversary

A year ago today I started keeping this notebook. The original idea was for it to be a social thing. Instead of keeping a private notebook of my computing experiments, I published everything in a blog to make it part of a conversation.

Who were the other voices?

A pattern emerged in how I went about finding ideas and exploring them. It seems I hit upon an interesting paper--often from a regular source like Google Labs, Bell Labs or the MIT PDOS group--and I'd then try and reason about the paper's subject in the context of Inferno. So the other voices I started listening to were from the authors of these great papers.

Sketching out a program from something I'd read in a paper helped me to understand the paper better and generate ideas that went beyond the paper. This was not just useful, but also a lot of fun. Much as writing is more than communication but an extension of the process of thinking, programming is not merely a corporate activity, or an economic activity, but also a philosophical activity. By which I mean, it's part of the process of reasoning about something.

The code was as much part of the conversation as the prose. And where the prose was english, the chosen platform for expressing the code was Inferno. By including the code an active element was added to the notebook. Everything I talked about could be pulled apart, put back together and run on an Inferno system.

Was Inferno the right choice of system?
How does Inferno help or hinder me to reason about computing problems? Alan Kay has said, "Point of view is worth 80 IQ points" when talking about the power of smalltalk. Paul Graham has expressed similar ideas about the power of Lisp. Why not do the Notebook in Squeak or Lisp? I don't have a good answer to this. It might be just accident that I started my career learning UNIX, moved on to Plan 9 and finally landed on Inferno. In a way it was part of my computing culture, and culture is king of all. I strongly believe Inferno is the most powerful environment I can most easily manage in which to explore ideas. It has broad scope, it is small and easy to understand, and the whole system is under my control. The only other environment that comes close is Squeak. But with my UNIX background I have much stronger connection to the UNIX philosophy of programming as opposed to an Object Oriented one.

Looking ahead it's hard to see what I'll be writing about next. The more papers I read and try out, the more it leads me onto other things that I know nothing of. I was interested for a while in symbolic programming, CA and Alife. I would like to try some DHT applications (from papers in MIT PDOS), but I have lost interest at the moment. No doubt my interest will come round again. However, notice the pattern again, the reason for my interest in these things is often caused by a good source of papers on these topics. In other words, who's talking about what. It's likely I'll find another source of papers on a subject I haven't considered before and will put a lot of effort in writing code to help me think through those new ideas.

I'd like to see the Inferno community grow with more programmers with whom I can talk more directly. I'd like to see wild new ideas about computing with Inferno as part of the language.

Sunday, August 21, 2005

NAME

NOTES

The first problem I had to deal with once extracting the data from the reality mining data set was how to distribute it across all the disks. So this lab describes how I did it.

The Geyron grid I have at home is a simulation grid. I'm just pretending that I have 16 real disks, when I actually only have two. [1] The sizes of things are also scaled down, so each disk is only 64MB instead of a more typical 40GB.

I divided each disk into 1MB chunks for this simulation. On each disk I created 60 files numbered 0.ckfree ... 59.ckfree. Each empty file is a record that an available chunk is on this disk. I don't use 64 because I leave room for slop. Each chunk is going to be slightly over 1MB. I do all this easily enough using my friend, the registry:

The idea behind this, as suggested in the google fs paper, is that each disk is the master record of used and free chunks that it contains. I scan all disks to update a cache for fast querying.
(Seq in the code above is a sequence generator as in plan 9 and included in the files for this lab.)

I created a file2chan server, chunkqueue, that can allocate a free chunk from all the available chunks across all disks. Writes to the file add free chunks to memory. Reads from the file return the next chunk. The chunk is chosen by randomly selecting a disk, then taking the head of a chunk list. The aim is to uniformly distribute the repository across all disks. But this will only work on average. With the few disks I have I'll still get many chunks on the same disk.

A important goal is to get very high parallelism in my distributed file processing. Two truly concurrent processes are ones that are running on independent CPUs and reading from separate disks with no communication between the two. If I have two disks I want to split the file between the disks and process both chunks at the same time on different CPUs. Anytime I get several chunks for the same repository on the same disk, and there is no other replica of those chunks on other disks, I am loosing some amount of parallelism.

I announce chunkqueue to the registry. I can then scan all disks, just like I did above, and write all the free chunks to the chunkqueue.

I assign one disk as the master file system that will keep a permanent record of chunks used by a repository file. The master is assigned in /lib/ndb/cluster and after a refresh shows up in the registry. A file on the master represents on repository file (BigFile) and contains a list of chunk handles. Also stored on the master is the mapping of chunk handle to disks, but this can be regenerated by periodically scanning all the disks. The mapping of chunks to disks is kept distinct from the repository-to-chunks mapping to allow for the possibility of chunks stored on multiple disks and chunk movement among disks. To locate a chunk I just grep the /n/master/locator file (see locatechunk in gridlib).

Now to tie it together with a command to write a file to the repository file.

I append files until I go over the CHUNKSIZE threshold. There are no files that span chunks. Each chunk is self contained, which I hope will make applying mapreduce easier.

When I extracted the data from the reality mining mysql dump I got over 200 1MB files. A database table, like callspan, would be broken up into about 20 files. To distribute the callspan table as one repository file across Geryon I ran the following,

It should be possible to see how a rudimentary inferno cluster file system could be built using similar techniques, hiding much of the above functionality behind a clean file system interface. I'll likely try this at some point. The problem that always gets me is that once I allow files to span chunks, how do I cleanly handle breaking data up for mapreduce?

I still haven't addressed fault tolerance and replication of chunks yet. If anyone cares to have a stab at it ...

Monday, August 08, 2005

NAME

lab 39 - tar and gzip

NOTES

In lab 38 I looked at file formats I could use to store a document repository. I kind of decided I needed an archive file of individually gzipped files. After a little further digging I find that the gzip file format (rfc) supports multiple gzip files concatenated together as a valid gzip file. A quick test on unix shows this to be true.

No. It stops reading after the first file. I looked at the code /appl/cmd/puttar.b and find it outputs zeroed blocks as a sort of null terminator for the file. I'm not sure if that's required to be a valid tar format file. The Inferno commands that read tar files don't seem to care since EOF works just as well. So I edited the file to not output zeroed blocks, and I renamed the command to putwar so not to confuse myself. Now I can append to a tar (war) file. What's more, I can put the gzip and tar together.

This is very close to what I want. I can process the whole archive in a stream, it is compressed, and if I know the offsets of each file I can jump directly to it and start the stream from there.

The remaining problems are that I don't know what meta information to store with the file, so I'm going with tar's information by default. Also, the tar format isn't handled by the sh-alphabet, which is a pity. But that doesn't matter because now I've got something concrete to play with which is good enough.

FILES

Sunday, August 07, 2005

NAME

NOTES

I need a large data set to work with so I can try out more ideas using Geyron.
I want to use real data; one that can not be analyzed trivially using, say, a relational database.

Examples I considered,

crawl the web - a web page repository

an rss feed repository

web server query logs

click logs for a site

aggregate data input by users

system health records

sousveillance logs

Some of these are more difficult to collect than others. Some may contain greater possibility for surprises, and the surprises are what I hope to get by working with real data.
Also, a data set where others can collect and duplicate my results would help to keep the conversation concrete.

But I need something right now, and there are two possibilites at hand. I've got the data set from the MIT Reality Mining project. This is about 200+MB uncompressed. This is big enough to test out some tools. But for this size data, Geyron is not likely to offer anything that can't be done in a relational database. However, in principle this data set could grow very large, and developing grid tools to process it might come in handy. For something larger I could use the Internet Archive's crawler to collect around 1 to 50GB of web pages. I'll go with the first option until I get a better idea of what I'm doing.

Before even collecting the data, I need to consider how to organize it. How it will be stored and processed. What file formats to use, etc.
So, what are the properties of a large data set?
Here's what Rob Pike et al say in their Sawzall paper about Google's repository,
"A document repository holding a few billion HTML pages might be stored
as several thousand files [each about 1GB in size] each storing a million
or so documents of a few kilobytes each, compressed."
A 1GB "file" is divided into 64MB chunks which is distributed among
many disks, and each chunk has 3 copies. Here's an image of the repository format from the Backrub paper.

I'll try and put that in context of Inferno. A close analog seems to be a tar file of gzipped files. I'll make it easier on myself just for the moment and turn that around into a gzipped tar file. How would I process a .tgz file in one pass?

I wrote a shell filter gettarentry,

gunzip < file.tgz |gettarentry { getlines {} | etc. }

where the command block parameter is applied to each file.

Gettarentry spawns a shell process for each
file and writes the data down a pipe between the two.
The environment variable file is set in the context for each command block.

After playing with this and thinking that this should be a shell builtin along the lines of sh's alphabet, I find it already is!

The fs command has bundle and unbundle, an equivalent to tar.

% fs bundle in | gzip > fs.bundle.gz

I hadn't considered the sh-alphabet yet in the context of Geryon. This simple use makes it already worthwhile, but there's much more to it. The fs command is a powerful tree walker, with gates and pattern matching and the ability to apply a command block to each file. The commands below show a simple comparison of fs walking the same file hierachy, one bundled and the other already extracted onto a kfs disk.

To read one small file within the bundle is not very efficient.
But reading and processing the whole fs
is faster if bundled and gzipped.
Time improvement is gained by considerably less disk reads (1/3),
and considerably less interaction with the fs for walking the hierarchy.

This format does not allow me to jump directly to a file within the archive. But if I switch back to the original suggestion, an archive of gzipped files, I get direct access and a few other things. It should be straight forward to append to the archive and know the total size of the archive as I add files.

I'll need to write another module for sh-alphabet to directly handle a repository format where each file is individually gzipped. But the fs/alphabet framework seems the way to go.

Another problem to handle is splitting the repository file among many kfs disks. If each 64MB chunk can stand alone as a bundle file, I could consider a 1GB file as just the concatenation of all the archive chunks. [1] It shouldn't matter in what order we process the chunks. If I build an index of each files location within the repository I need to track the chunk, or keep track of the order of chunks and just keep a single offset within the logical 1GB repository file.

The Internet Archive's crawler, Heritrix, stores web pages in the ARC file format.
I could add this format to sh-alphabet so it can be processed by fs. The crawler splits the archive into 100MB files. So by using this I've already got a lot going for me.

FILES

The files for this lab include the gettarentry command and a command to extract the data from the Reality Mining MySQL dump into a format better suited to Inferno.
caerwyn.com/lab/38

Footnotes

[1] Because I am layering file systems my terminology is getting confusing.
I have chunks and files and disks at several layers.