Trulia Hindsight launched today. I can't claim credit for the gorgeous time interface, but I can say that this is the second reasonably high-profile project that uses Modest Maps as its tile display engine.

Also I promise that next time, there'll actually be something to look at beyond of me noodling with computer pseudo-science.

When I first opened up the Oakland crime index, I published data in two forms: data about crime was stored in day/type resources, e.g. May 3rd murders or Jan 1st robberies, while binary-search indexes on the case number, latitude, and longitude were published with pointers to the day/type resources. As I've experimented with code to consume the data and kicked these ideas around with others, a few obvious changes had to be made:

First, the separate b-trees on latitude and longitude had to go. Location is 2-dimensional, and requires an appropriate index to fit. I had initially expected to use r-trees but found that quadtrees, a special case, made the most sense. These are closest in spirit to the b-tree, and unlike the r-tree each sub-index does not overlap with any other.

Second, space and time are intricately related, so spatiotemporal index was an obvious next step. I chose an oct-tree of latitude, longitude, and time. Again, this is a simple extension of the b-tree, and provides for simple answers like "show all crimes that are within a mile of a given point, for the following dates..."

Third, I was being too literal with the indexes, insisting that traversing the trees should ultimately lead back to a link to a specific day/type listing. Although this is how a real database index might work, in the context of an index served over HTTP, a large number of transactions can be avoided by just dropping the actual data right into the index. To understand what this means, compare the CSS-styled output of the various indexes to the HTML source: the complete data for each crime is stashed in a display: none block right in the appropriate node.

Finally, my initial implementation used the binary tree lingo "left" and "right" to mark the branches in each index. I've replaced this with more obvious "before", "after", "north", "south", "east", and "west" for greater ease of human-readability and consumption.

I'm still hosting the data on Amazon's S3, but a recent billing change is making me re-think the wisdom of doing this:

New Pricing (effective June 1st, 2007): $0.01 per 1,000 PUT or LIST requests, $0.01 per 10,000 GET and all other requests.

Eep.

In one week, S3 is going to go from a sensible storage/hosting platform for data consisting of many tiny resources, to one optimized for data consisting of fewer, chunkier resources; think movies instead of tiles. I can see the logic behind this: S3's processing overhead for serving a million 1KB requests must be substantial compared to serving a thousand 1MB requests. Still, it makes my strategy of publishing these indexes as large collections of tiny files, many of which will never be accessed, start to seem a bit problematic.

The obvious answer is to stash them on the filesystem, which I plan to do. However, there is one feature of S3 that I'm going to miss: when publishing data to their servers, any HTTP header starting with "X-AMZ-Meta-" got to ride along as metadata, allowing me to easily implement a variant of mark and sweep garbage collection when posting updates to the indexes. This made it tremendously easy to simulate atomic updates by keeping the entire index tree around for at least 5 minutes after a replacement tree was put in place, a benefit for slow clients.

When I move the index to a non-S3 location before my Amazon-imposed June 1st deadline, I will no longer have the benefit of per-resource metadata to work with.

I'm de-cloaking for a moment here to mention John Batelle's excellent Data Bill of Rights, published about a month ago. It popped into relief again for me with the announcement of Google's purchase of FeedBurner, and all the RSS traffic data that rides along.

The rights, enumerated:

Data Transparency. We can identify and review the data that companies have about us.

Data Portability. We can take copies of that data out of the company's coffers and offer it to others or just keep copies for ourselves.

Data Editing. We can request deletions, editing, clarifications of our data for accuracy and privacy.

Data Anonymity. We can request that our data not be used, cognizant of the fact that that may mean services are unavailable to us.

Data Use. We have rights to know how our data is being used inside a company.

Data Value. The right to sell our data to the highest bidder.

Data Permissions. The right to set permissions as to who might use/benefit from/have access to our data.

I like where this is going, but I believe that it's a bit toothless unless the ownership of that data is clarified. As long as the legal owner of personal data is assumed to be the company in possession (Google, FeedBurner, Facebook, etc.), the enumerated rights will be considered the responsibility of P.R. and marketing. If it were somehow possible to push the bill of rights into the legal department, this idea would gain some serious traction. It would also have the possibly-beneficial side effect of depressing valuations for data collection companies like FeedBurner or DoubleClick, or even Google itself. It might also have a similar effect on the financial world, giving companies such as ChoicePoint a well-deserved kick in the teeth.

The Organization Man is a classic by William Whyte that was first recommended to me by Abe almost two years ago. It took me until just recently to pop it off my Amazon stack and give it a read. It's a major critical investigation of American society in the 1950's, written from deep inside that era in 1956. Whyte covers work, education, religion, and suburbia in his sharp description of what he believes to be a problematic development: the post-war emergence and celebration of groupthink and conformity in all forms of corporate and organization life.

Half the fun of this book is Whyte's sharp prose. He has a lot of data, a good eye for observation, and a clear opinion he's not interested in holding back.

Sorry that this is kind of a long post, but it's a great read full of worthwhile passages.

Page 19, on the qualitative difference between small business and the corporation:

Out of inertia, the small business is praised as the acorn from which a great oak may grow, the shadow of one man that may lengthen into a large enterprise. Examine businesses with 50 or less employees, however, and it becomes apparent the sentimentality obscures some profound differences. ... The great majority of small business firms cannot be placed on any continuum with the corporation. For one thing, they are rarely engaged in primary industry; for the most part they are the laundries, the insurance agencies, the restaurants, the drugstores, the bottling plants, the lumber yards, the automobile dealers. They are vital, to be sure, but they essentially service an economy; they do not create new money within their area and they are dependent ultimately on the business and agriculture that does.

Page 34, on Hawthorne and economic man:

In the literature of human relation the Hawthorne experiment is customarily regarded as a discovery. In large part it was; more than any other event, it dramatized the inadequacy of the purely economic view of man.

Page 35, on social discipline:

In the Middle Ages people had been disciplined by social codes into working together. The Industrial Revolution, as Mayo described the consequences, had split society into a whole host of conflicting groups. Part of a man belonged to one group, part to another, and he was bewildered; no longer was there one group in which he could sublimate himself. The liberal philosophers, who were quite happy to see an end to feudal belongingness, interpreted this release from the group as freedom. Mayo did not see it this way. To him, the dominant urge of mankind is to belong: "Man's desire is to be continuously associated in work with his fellows," he states, "is a strong, of not the strongest, human characteristic."

Page 78, on education:

How did he get that way? His elders taught him to be that way. In this chapter I am going to take up the content of his education and argue that a large part of the U.S. educational system is preparing people badly for the organization society - precisely because it is trying so very hard to do it. My charge rests on the premise that what the organization man needs most from education is the intellectual armor of the fundamental disciplines. It is indeed an age of group action, of specialization, but this is all the more reason the organization man does not need the emphases of a training "geared for the modern man." The pressures of organization life will teach him that. But they will not teach him what the schools and colleges can - some kind of foundation, some sense of where we came from, so that he can judge where he is, and where he is going and why.

Page 150, on executive aspirations:

We have, in sum, a man who is so completely involved in his work that he cannot distinguish between work and the rest of his life - and happy that he cannot. ... No dreams of Gothic castles or liveried footmen seize his imagination. His house will never be a monument, an end in itself. It is purely functional, a place to salve the wounds and store up energy for what's ahead. And that, he knows full well, is battle.

Pages 157-158, on the loneliness of authority:

Just when a man becomes an executive is impossible to determine, and some men never know just when the moment of self-realization comes. But there seems to be a time in a man's life - sometimes 30, sometimes as late as 45 - when he feels that he has made the irrevocable self-commitment. At this point he is going to feel a loneliness he never felt before. If he had the toughness of mind to get this far he knows very well that there are going to be constant clashes between himself and his environment, and he knows that he must often face these clashes alone. His home life will be shorter and his wife less and less interested in the struggle. In the midst of the crowd at the office he will be isolated - no longer intimate with the people he has passed and not yet accepted by the elders he has joined.

Pages 194-195, on personality testing:

Few test takers can believe the flagrantly silly statement in the preamble to many tests that there are "no right or wrong answers." There wouldn't be much point in the company's giving the test if some answers weren't regarded as better than others. Telling the truth about yourself is difficult in any event. When someone is likely to reward you if you give answers favorable to yourself the problem of whether to tell the truth becomes more than insuperable; it becomes irrelevant.

"Do you daydream frequently?" In many companies a man either so honest or so stupid as to answer "yes" would be well advised to look elsewhere for employment.

Pages 196-198, on strategies for personality tests:

When in doubt about the most beneficial answer to any question, repeat to yourself: I loved my father and my mother, but my father a little bit more. I like things pretty much the way they are. I never worry much about anything. I don't care for books or music much. I love my wife and children. I don't let them get in the way of company work.

Jacques Barzun says in his Teacher in America, "I have kept track for some ten years of the effects of such tests on the upper half of each class. The best men go down one grade, and the next best go up. It is not hard to see why. The second-rate do well in school and in life because of their ability to grasp what is accepted and conventional. ... But first-rate men are rarer and equally indispensable. ... To them, a ready-made question is an obstacle. It paralyzes thought by cutting off all connections but one. ... Their minds have finer adjustments, more imagination, which the test deliberately penalizes as encumbrances."

Pages 208-209, on pure vs. applied research:

The failure to recognize the value of purposelessness is the starting point of industry's problem. To the managers and engineers who set the dominant tone in industry, purposelessness is anathema, and all their impulses incline them to highly planned, systematized development in which the problem is clearly defined. ... In pure research, however, half the trick is finding out that there is a problem - that there is something to explain. The culture dish remained sterile when it shouldn't have. The two chemicals reacted differently this time than before. Something has happened and you don't know why it happened - or if you did, what earthly use would it be? By its very nature, discovery has an accidental quality. Methodical as one can be in following up a question, the all-important question itself is likely to be a sort of chance distraction of the work at hand. At this moment you neither know what practical use the question could lead to nor should you worry the point. There will be time enough later for that; and in retrospect, it will be easy to show how well planned and systematized the discovery was all along.

Page 250, on the organization man in fiction:

But this does not mean that our fiction has become fundamentally any less materialistic. It hasn't, it's just more hypocritical about it. Today's heroes don't lust for big riches, but they are positively greedy for the good life. This yen, furthermore, is customarily interpreted as a renunciation of materialism rather than as the embrace of it that it actually is. ... After making his spurious choice between good and evil, the hero heads for the country, where, presumably, he is now to find the real meaning in life. Just what this meaning will be is hard to see; in the new egalitarianism of the market place, his precipitous flight from the bitch goddess success will enable him to live a lot more comfortably than the ulcerated colleagues left behind, and in more than one sense, it's the latter who are less materialistic. Our hero has left the battlefield where his real fight must be fought; by puttering at a country newspaper and patronizing himself into a native, he evades any conflict, and in the process manages to live reasonably high off the hog. There's no Cadillac, bu the Hillman Minx does pretty well, the chickens are stacked high in the deep freeze, and no doubt there is a hi-fi set in the table which he and his wife have converted. All this may be very sensible, but it's mighty comfortable for a hair shirt.

Page 279, on transience or purpose:

Their allegiance is more to The Organization itself than to any particular one, for it is in the development of their professional techniques, not in ideology, that they find continuity - and this, perhaps, is one more reason why managerial people have not coalesced into a ruling class. "They have not taken over the governing functions," Max Lerner has pointed out, "nor is there any sign that they want to or can. They have concentrated on the fact of their skills rather than the uses to which their skills are put. The question of the cui bono the technician regards as beyond his technical competence."

Page 282, on suburbia:

Looking at the real estate situation right after the war, a group of Chicago businessmen saw that there was a huge population of young veterans, but little available housing suitable for people with (1) children, (2) expectations of transfer, (3) a taste for good living, (4) not too much money. Why not, the group figured, build an entire new community from scratch for these people?

Page 302, on anomalies in the suburbs:

One court was thoroughly confounded by the arrival of a housewife who was an ex-burlesque stripper and, worse yet, volubly proud of the fact. She never learned, and the collision between her breezy outlook and the family mores of the court was near catastrophic. "They're just jealous because I'm theatrical folk," she told an observer, as she prepared to depart with her husband in a cloud of smoke. "All these wives think I want their husbands. What a laugh. I don't even want my own. The bitches." The court has never been quite the same since.

Page 335, on the roots of soul and the importance of initial conditions:

It is much the same question as why one city has a "soul" while another, with just as many economic advantages, does not. In most communities the causes lie far back in the past; in the new suburbia, however, the high turnover has compressed in a few years the equivalent of several generations. Almost as if we were watching stop-action photography, we can see how traditions form and mature and why one place "takes" and another doesn't. Of all the factors, the character of the original settlers seems the most important. In the early phase the impact of the strong personality, good or otherwise, is magnified.

Page 343, on the communications value of children:

With their remarkable sensitivity to social nuance, the children are a highly effective communication net, and parents sometimes use them to transmit what custom dictates elders cannot say face to face. "One newcomer gave us quite a problem in our court," says a resident in an eastern development. "The was a Ph.D., and he started to pull rank on some of the rest of us. I told my kid he could tell his kid that the other fathers around here had plenty on the ball. I guess we fathers all did the same thing; pretty soon the news trickled upwards to this guy. He isn't a bad sort; he got the hint - and there was open break of any kind."

Pages 359-360, on the downside of group activity:

Perhaps the greatest tyranny, however, applies not to the deviate but to the accepted. The group is a jealous master. It encourages participation, indeed, demands it, but it demands one kind of participation - its own kind - and the better integrated with it a member becomes the less free he is to express himself in other ways.

Page 362, on the tyranny of involvement:

Well? Fromm might as well have cited Park Forest again. One must be consistent. Park Foresters illustrate conformity; they also illustrate very much the same kind of small group activity Fromm advocates. He has damned an effect and praised a cause. More participation may well be in order, but it is not the antidote to conformity; it is inextricably related with it, and while the benefits may well outweigh the disadvantages, we cannot intensify the former and expect to eliminate the latter. There is a true dilemma here. It is not despite the success of their group that Park Foresters are troubled but partly because of it, for that much more do they feel an obligation to yield to the group. And to this problem there can be no solution.

Is there a middle way? A recognition of this dilemma is the condition of it. It is only part of the battle, but unless the individual understands that this conflict of allegiances is inevitable he is intellectually without defenses. And the more benevolent the group, the more, not the less, he needs these defenses.

The small demo at that second link above hooks up to a quick database-driven web service written in PHP, and making it live drove home the point that hosting live databases is tedious and unsatisfying.

Meanwhile, Tom Coates is drumming away about natives to a web of data, Matt Biddulph is telling information architects about RDF and API's, and Mark Atwood is releasing S3-backed MySQL storage engines. Putting these threads together suggests an interesting, or at least more durable, way of publishing pure data on the web. The MySQL engine is an interesting stake in the ground, but it hides its data and its index (the two primary components of a relational database) behind the usual MySQL server process. The contents of storage aren't open to data consumers, ditching many of the cost and scale advantages of a service like S3 by piping it all through your annoying old DB server. Tom and Matt already have the data-on-the-web bit covered, so I'm going to do something about the index.

Indexes to a database table are exactly what they are to anything else: a faster way to look up information than scanning through it all in order. It's how you jump straight to the "M's" in the phone book without a lot of paging back and forth. The most popular style of index is something called a binary tree. Imagine looking for a particular word in the dictionary: you open the book up to some page in the middle of the book, check to see whether your word is before, on, or after the current page, and then move back and forward in the book in large chunks of pages until you've found what you're searching for. This is generally much faster than starting at "A" and turning single pages to find your word. A binary tree works the same way.

Indexes are rarely exposed, even on good web-of-data citizens. Both Flickr and Twitter make it somewhatdifficult to move through giant lists, though not anymore difficult than other sites. Meanwhile, the databases quietly running these services are wildly denormalized and indexed like crazy, making it possible to rapidly generate those long, long lists.

For the crime reports, I started by just getting the data up and public. It's at predictable URL's, like these:

If you are looking for crimes on a particular date with a particular type, you just ask for a guessable URL. This is in effect the primary key: the natural, internal storage format for the data. Most common types of crime happen on most days, so the majority of date/type combinations should Just Work, and a simple HTTP 4XX error tells you when there is no match. I've chosen to publish in XHTML format for two reasons: the markup is highly semantic, making it simultaneously machine-readable and human-readable. Realistically, I'll be adding JSON and POX pages soon.

Unfortunately, if you're looking for a particular case number, or crimes at a particular location, it would require hunting through every page of crimes. In database terms, this is known as a table scan, and is something to be avoided at all costs. Instead, I've created a set of indexes to the data, demonstrating the key trade-off: an index helps you find what you want, but takes space to store and time to calculate. Following the Case Number link above takes you to a page with a long, nested list on it, a binary search tree. The idea is that you enter looking for a particular case number or range of case numbers. You start by comparing the one you want to the one at the top of the page. If they match, you're done. If yours is smaller, you proceed to the first nested list. If it's larger, you proceed to the second. Eventually, you arrive at the number you want and get back a pointer to one of the date/type pages above where that particular case number can be found. For example, searching for case number 07-015248 gets you Oakland-2007-02-22-ROBBERY.html.

I've also chosen to use b-trees for latitude and longitude, but these will soon be replaced: r-trees are a similar format more suitable to two-dimensional information used by geographic systems such as PostGIS.

In a database, this link-following and tree-climbing process happens very quickly on a single server, ideally in RAM with a minimal number of disk hits. In the scheme I use, a lot of the processing overhead is offloaded to smarter clients: Flash or Ajax apps that know they're looking at an index, and understand a thing or two about traversing data structures. Disk access is replaced by network access. The information is chunkier (longer lists, fewer requests) to minimize network overhead as much as possible, but it's certainly not going to be as speedy as a connection to a real database. There's a short list of reasons to do this:

A "database" that offers nothing but static file downloads will likely be more scalable than one that needs to do work internally. This architecture is even more shared-nothing than systems with multiple database slaves.

Not needing a running process to serve requests makes publishing less of a headache.

The particular data involved is well-suited to this method. A lot of current web services are optimized for heavy reads and infrequent writes. Often, they use a MySQL master/slave setup where the occasional write happens on one master database server, and a small army of slaves along with liberal use of caching makes it possible for large numbers of concurrent users to read. Here, we've got infrequently-updated information from a single source, and no user input whatsoever. It makes sense for the expensive processing of uploading and indexing to happen in one place, about once per day.

I'm reasonably happy with this so far, but I haven't yet written a smart client to take advantage of it. The near-term plan is to replace the two latitude/longitude indexes with a single spatial index, and then revisit the whole thing after I have an idea of how complicated it is to consume.

We gave a talk about our work at SOM on Tuesday, and in return they offered a tour of the Oakland Cathedral construction site. This was a special treat for Gem and I, because we live a few blocks away from the site, and have been jealously plotting to sneak in ever since they broke ground last year.

A few things we learned: the Cathedral sits atop a crypt, whose contents are a major source of revenue for the Diocese. They refer to the spaces they sell as "product". There is space planned for an organ, but organ design is something that has to take place after the space is built, because acoustics are so touchy. Fortunately, their organ designer happens to live in Oakland. The reliquary itself is seismically isolated from the ground below and the remainder of the site, and is spec'd to stand for 300 years.

Shock of the Old is a technology book by David Edgerton that focuses on use in favor of invention, illustrated with examples of under-the-radar technologies (e.g. corrugated iron, DDT, etc.) that make a larger social impact than more visible, highly-touted inventions. These are a few interesting passages I've marked.

Pages 75-76:

As one philosopher of technology noted in the 1970s: "In almost no instance can artificial-rational systems be built and left alone. They require continued attention, rebuilding, and repair. Eternal vigilance is the price of artificial complexity." He noted too, that in a technological age we should ask not who governs, but what governs: "government becomes the business of recognising what is necessary and efficient for the continued functioning and elaboration of large-scale systems and the ration implementation of their manifest requirements."

Page 83:

So concerned were Ford with maintenance and repair that they investigated and standardised repair procedures, which were incorporated into a huge manual published in 1925. ... However, this plan did not work - it could not cope with the many vicissitudes and uncertainties of the car-repair business. The Fordisation of maintenance and repair, even of the Model T, did not work. As the British naval officer in charge of ship construction and maintenance in the 1920's put it: "repair work has no connection with mass-production."

Page 89, on jet engines:

Typically, there is at first a slight rise (because of unanticipated problems) and then a fall over ten years to 30 per cent of the original maintenance cost. This is due to increasing confidence in the engine itself and increasing knowledge of what needs maintenance. In other words, the maintenance schemes, programmes, and costs are not programmable in advance. In these complex system a great infrastructure of documentation, control, and surveillance is needed, and yet informl, tacit knowledge remains extremely important.

Page 114-115:

In the early 1930's there were all sorts of suggestions for the creation of an "international air police" along these lines, and similar thinking continued into the 1940's, usually with the British and Americans as that international police force. In more recent years the atomic bomb, television, and above all the internet and world-wide web have featured in this kind of techno-globalism. As we have seen, it was generally the older technologies which were crucial to global relations - today's globalisation is in part the result of extremely cheap sea and air transport, and radio and wire-based communications.

Page 169, on food production and slaughterhouses:

To understand the uniqueness and significance of these reeking factories of death, it is illuminating to cross ... the Mediterranean a century later, against a new tide of migration into Europe. In late twentieth-centure Tunisia, on several main roads through the desert there were concentrations of nearly identical small buildings lining each side of the road. Tethered next to many were a few sheep; hanging from the buildings were the still fleece-covered carcassas of their cousins. For these were the butchers' shops and restaurants. As the heavy traffic roared by one could dine, on plastic tables, without plates or cutlery, on delicious pieces of lamb taken straight from the displayed cadaver and cooked on a barbecue crudely fashioned from sheet metal. Clealy this spectacle was not a left-over from the past, or the sort of thing which attracted tourists. It was something new; a drive-in barby for the Tunisian motorist and lorry-driver in a hurry.

Page 189, on belief in technical progress:

There is an old Soviet joke which goes to the heart of the issue: an inventor goes into the ministry and says: "I have invented a new button-holing machine for our clothing industry." "Comrade," says the minister, "we have no use for your machine: don't you realise this is the age of Sputnik?" Such sentiments shaped policy, not only in rockets, and not only in the Soviet Union.

Alex Bosworth created Who's Digging You?, a javascript-based app that cralws over your list of submitted stories and finds the people who've dugg them the most. Also throws in the usernames of submitters whose stories you digg the most for good measure.

Derek Van Vliet made the Smart Digg Button, a Firefox browser extension that checks with Digg for every page you visit, and inserts a tiny display of digg counts for that URL from Digg. If this were Google, I'd be worried - the extension necessarily sends Digg a record of every page you visit, so it raises some privacy alarms. Still really neat though.

Diggest is a player that shows popular videos and the Digg comments attached to them. It's the first comment-based API use I've seen, and has a great MST3K/peanut gallery feel.

Derek Van Vliet also wrote PyDigg, one of many language-specific API toolkits. I've seen others for .NET, Ruby, Java, and so on, but Python is the language closest to my heart so I'm linking to this one.