Saturday, January 28, 2012

This is a bug that I introduced into Derby in October, 2009, when I re-wrote Derby's GROUP BY engine to add multi-level aggregation support (DERBY-3002).

The new bug is rather complicated, because it involves (a) multiply-scanned result sets and (b) DISTINCT aggregates.

Let's see if I can explain those two concepts, starting first with DISTINCT aggregates.

Aggregates are a feature of the SQL language. The SQL statement normally reports on individual rows in your table(s): each row in your result can be traced back to a particular row in a particular table (possibly multiple tables, joined together). Aggregates, however, are functions which collapse the detail of individual rows of data into coarser-grained summarizations.

The classic aggregates in the core SQL language are: MIN, MAX, COUNT, SUM, and AVG, and these have been part of the language for decades. Most DBMS implementations add additional aggregate functions, but it's sufficient for our purposes to consider the base 5.

An example of an aggregate query would be to figure out the average mileage of flights between San Francisco and San Antonio:

When performing these sorts of aggregate queries, it's often quite useful to break things down by grouping the results based on some other fields. For example, suppose we wanted to know whether the distance flown tends to vary depending on the day of the week (do weekend flights get to take a shorter flight path?):

Executing a query such as this is performed (in Derby, as in many other databases), by sorting the data on the grouping column(s) and then processing it in order. So we'd: (a) find all the rows matching the WHERE clause, (b) pull the departure_day_of_week and flight_miles columns out of each row, (c) sort the rows by departure_day_of_week, and then pump all that data through the aggregation engine.

The aggregation engine would then be able to see the rows in a single pass: first all the "FRI" rows, then all the "MON" rows, then "SAT" rows, "SUN" rows, "THU" rows, and "TUE", and lastly the "WED" rows. For each group of rows, the engine computes the AVG(flight_miles) in the obvious straightforward fashion (computing the total sum and the number of instances and dividing). The result of the query is 7 rows, with the average for each day.

Note that the author of the query generally includes the grouping columns in the result, to make the result easy to read.

You can GROUP BY multiple columns, but the overall principle is the same: the columns used to sort the records are exactly the columns used to group the results.

DISTINCT aggregates add some additional complexity. Suppose that, if two flights happen to have exactly the same flight_miles, we only want to consider the flight in the average once; that is, we only want to compute the average over all the unique values of flight_miles for that particular departure_day_of_week.

(I admit, this is very contrived, but hopefully it's still obvious what we're talking about here.)

In this case, the programmer can use the word DISTINCT to specify this:

This adds a small additional wrinkle into the query execution, as we need to ensure that we can refine the results to consider each unique combination of destination_day_of_week and flight_miles only once.

Derby accomplishes this by including the DISTINCT aggregate column into the sorting process, so that the rows are sorted by the compound key (departure_day_of_week, flight_miles). This way, all the rows which continue duplicate values for this pair of columns arrive together, and we can consider only the first such row and discard the others.

When I re-wrote the GROUP BY engine as part of DERBY-3002, I considered this problem, and implemented it, but I made a mistake. Note that, above, I observed that "the columns used to sort the records are exactly the columns used to group the results". However, with DISTINCT aggregates, this isn't precisely true, as there is one extra column in the sort key (the DISTINCT aggregate column) which isn't used to group the results, just to sort them.

In DERBY-3002, I handled that special case by: after the sort, but before the grouping pass, I removed the DISTINCT aggregate column from the sort key.

This worked, but ...

For most queries, the natural flow of query execution processes each intermediate result once. This is efficient and so the database works hard to do this whenever possible.

However, there are some cases in which an intermediate result must be processed multiple times. One example is the so-called "Cartesian Product" query:

SELECT a.*, b.* FROM a, b

In this query, the database produces all possible combinations of rows from tables a and b: for each row in table a, and each row in table b, there is a row in the result containing those values from a's row and b's row.

In such a query, Derby uses a brute force technique and simply implements the query as: for each row in a, read each row of b, and emit the values.

This means that we read the inner table (table b), multiple times, once per row in table a.

This was where my bug came in: it is possible that the inner table is a GROUP BY query of its own, which includes a DISTINCT aggreate.

When this happens, my code that removed the DISTINCT aggregate column from the sort key was causing problems. The first time we read the inner table, everything was fine, but then the next time, the ordering/grouping/sorting columns were all messed up (I removed the column from the key, but didn't add it back for the next go-round).

In DERBY-5584, you can see a clear and simple example of a query which demonstrates the bug.

The fix, I believe, is that instead of physically removing the column from the sort key, we need to instead teach the grouping logic that it may need to consider only a leading subset of the columns in the sort key as the grouping column.

I dug this great post from Cliff Mass about the state of modern weather observation instrumentation.

As I read it, I was remembering a discussion I'd had with my sailing friend Nick about whether the real-time wind reports at http://www.sailflow.com could/should include information from the boats themselves (to start with, from large commercial vessels or government vessels such as the USCG small boats, but eventually might each sailboat itself report on the wind conditions it's observing)?

So it was a kick to read about how large-scale weather forecasting in fact does include such information, and to come across this subtle observation about the feedback loop that occurs with that process:

As part of the Volunteer Observing System (VOS), mariners take observations every six hours. The light blue dots on this chart show where ships were reporting at one particular time:

...

[But] as forecasts get better, the ships avoid the areas we really need data---in the middle and near major storms. When forecasts were bad, ships would get trapped in dangerous conditions--now they can get out of the way.

Technology improves, and the way that humans use that technology improves as well, and the whole inter-connected world evolves.

Thursday, January 26, 2012

All right, I know, I said I was not writing about these topics, but, really, have you read this report in today's New York Times about the conditions in the Chinese factories that build the world's high-tech gadgetry?

Those accommodations were better than many of the company’s dorms, where 70,000 Foxconn workers lived, at times stuffed 20 people to a three-room apartment, employees said. Last year, a dispute over paychecks set off a riot in one of the dormitories, and workers started throwing bottles, trash cans and flaming paper from their windows, according to witnesses. Two hundred police officers wrestled with workers, arresting eight. Afterward, trash cans were removed, and piles of rubbish — and rodents — became a problem.

It's a long and detailed article; you need to read it all. But it concludes with this observation:

“You can either manufacture in comfortable, worker-friendly factories, or you can reinvent the product every year, and make it better and faster and cheaper, which requires factories that seem harsh by American standards,” said a current Apple executive.

“And right now, customers care more about a new iPhone than working conditions in China.”

These meetings can lead the company to move dozens of jobs to another country or, in some cases, to create new jobs in the U.S. When Standard decided to increase its fuel-injector production, it chose to do that in the U.S., and staffed up accordingly (that’s how Maddie got her job). Standard will not drop a line in the U.S. and begin outsourcing it to China for a few pennies in savings. “I need to save a lot to go to China,” says Ed Harris, who is in charge of identifying new manufacturing sources in Asia. “There’s a lot of hassle: shipping costs, time, Chinese companies aren’t as reliable. We need to save at least 40 percent off the U.S. price. I’m not going to China to save 10 percent.” Yet often, the savings are more than enough to offset the hassles and expense of working with Chinese factories. Some parts—especially relatively simple ones that Standard needs in bulk—can cost 80 percent less to make in China.

These are complex issues. I'm pleased that journalists are taking the time to really dig into them, and to help educate us all, for as Dan Lyons said: "The problem I’m having isn’t with Apple, but with me."

Wednesday, January 25, 2012

First, do you know where, I mean exactly where, the Web was invented? In case you don't, David Galbraith helps us track down the precise location, complete with a Google-mapped satellite view and a short interview with TBL himself:

I wrote the proposal, and developed the code in Building 31.
I was on the second (in the European sense) floor, if you come out of the elevator (a very slow freight elevator at the time anyway) and turn immediately right you would then walk into one of the two offices I inhabited. The two offices (which of course may have been rearranged since then) were different sizes: the one to the left (a gentle R turn out of the elevator) benefited from extra length as it was by neither staircase nor elevator.
The one to the right (or a sharp R turn out of the elevator) was shorter and the one I started in. I shared it for a long time with Claude Bizeau.
I think I wrote the memo there.

I love the photo of the nice young fellow who currently occupies the office...!

Secondly, this is all-too-faddish, but you've got to check out this delightful video full of the latest Silicon Valley cliches.

"Who has a party in Palo Alto?"

...

"It's like Pandora ... for cats!"

Great stuff, and it really is quite accurate and well-written.

Lastly, this brought back great memories of my time in New England just after we got married. We lived in Boston (Brighton, to be precise); I worked in Kendall Square in Cambridge; many of my friends switched jobs and worked at "the old Lotus building", which at the time was the new Lotus building, back when Lotus was its own company and was the hot new place to be; I rode my bike to and from the office along Western Avenue and through Central Square; and I went to grad school 5 stops down the Red Line at UMass Boston (while my wife was taking a film studies class at BU).

I miss those days in Boston.

But the point of Feld's essay, and the point, really, of all three essays/films, is that location matters.
It really does. I loved my time in Boston, and even though Brad Feld is right that East Cambridge is one of the most remarkable and wonderful spots in the country, the Bay Area is even better, which is why we (reluctantly) switched coasts in 1988.

Medfield is a credible SoC for smartphones and is good enough to begin the process of building vendor and carrier relationships for Intel. This is particularly true, given Intel’s attractive roadmap. In 2013, Intel will ship a 22nm FinFET SoC with the new, power-optimized Silvermont CPU and the recently announced PowerVR Series 6 graphics. The rest of the world will ramp 20/22nm in 2014 at the earliest, a gap of 6-12 months. Judging by Intel’s plans for 14nm SoCs based on the Airmont CPU core in 2014, this process technology advantage is only likely to grow over time. Whether that advantage will yield a significant smart phone market share for Intel is uncertain, but Medfield clearly demonstrates that it is possible.

Even if Intel doesn't succeed, their continued presence in the marketplace and competition for market share will spur the big players (Qualcomm, TI, ARM, etc.) to improve their own systems, not just rest on their laurels.

Why here? This mailing list is the best approximation of the HTTP community; it has participation (or at least presence) from most implementations, including browsers, servers, intermediaries, CDNs, libraries, tools, etc. I firmly believe that as HTTP evolves, it needs to accommodate the entire community, not just the selected needs of a subset, so rather than creating a new WG or having a private collaboration, it should happen here.

This won't be easy, but it's great to see them trying. I suspect that these topics are some of what they'll be talking about.

And if you don't have enough to read yet, here's a very nice, compact, and well-rounded Distributed Systems Reader. As it turns out, I was already familiar with all but two of these papers, but hey! Two new papers on the foundations of Distributed Systems; what's not to like!

CAPTCHAs, of course, are those strange little pictures full of numbers and letters, generally semi-garbled or semi-distorted, that various web pages ask you to type in in order to prove you're a human being, not a robot. The idea is to dissuade people who are writing programs to manipulate web pages that are only intended to be manipulated by human beings. The acronym stands for "Completely Automated Public Turing test to tell Computers and Humans Apart".

Do they work? Well, according to the UCSD study, they actually do work fairly well, according to the strict interpretation of their stated goal. After testing several specialized programs designed to be able to "break" CAPTCHAs, the researchers found that the automated solvers generally were not successful:

We observed an accuracy of 30% for the 2008-era test set and 18% for the 2009-era test set using the default setting of 613 iterations, far lower than the average human accuracy for the same challenges (75–90% in our experiments).

So CAPTCHAs are working, right?

Well, not so fast.

It turns out that there is an immense industry of CAPTCHA-solving, and the solvers are actual human beings, not computer programs:

there exists a pool of workers who are willing to interactively solve CAPTCHAs in exchange for less money than the solutions are worth to the client paying for their services.

These people apparently sit in front of computers for hours at a time, doing nothing but solving CAPTCHAs that are displayed in front of them by Internet-based solving services that then turn around and sell these solutions to clients willing to pay for CAPTCHA solutions:

Since solving is an unskilled activity, it can easily be sourced, via the Internet, from the most advantageous labor market—namely the one with the lowest labor cost. We see anecdotal evidence of precisely this pattern as advertisers switched from pursuing laborers in Eastern Europe to those in Bangladesh, China, India and Vietnam

How much do these people end up getting paid? Almost nothing, but still enough to attract workers:

on Jan. 1st, 2010, the average monthly payout to the top 100 earners decreased to $47.32. In general, these earnings are roughly consistent with wages paid to low-income textile workers in Asia [12], suggesting that CAPTCHA-solving is being outsourced to similar labor pools

What do the authors conclude from all of this? The answer is that you can view the whole arrangement in an economics framework:

Put simply, a CAPTCHA reduces an attacker’s expected profit by the cost of solving the CAPTCHA. If the attacker’s revenue cannot cover this cost, CAPTCHAs as a defense mechanism have succeeded. Indeed, for many sites (e.g., low PageRank blogs), CAPTCHAs alone may be sufficient to dissuade abuse. For higher-value sites, CAPTCHAs place a utilization constraint on otherwise “free” resources, below which it makes no sense to target them. Taking e-mail spam as an example, let us suppose that each newly registered Web mail account can send some number of spam messages before being shut down. The marginal revenue per message is given by the average revenue per sale divided by the expected number of messages needed to generate a single sale. For pharmaceutical spam, Kanich et al. [14] estimate the marginal revenue per message to be roughly $0.00001; at $1 per 1,000 CAPTCHAs, a new Web mail account starts to break even only after about 100 messages sent.

It's a cold, calculating, unfeeling analysis, but it's an absolutely fascinating paper, easy to read and full of lots of examples and descriptions of the details behind this corner of the Internet. I never knew this existed, and I'm wiser now that I do.

However, I still felt the irresistible urge to go and wash my hands after learning all this. :(

Monday, January 23, 2012

In this case, the solution is to work on the language of the bills to rule out the sorts of abuses that the big Web sites fear. (And to fix the other minor point, which is that the bills won’t work. For example, they’d make American Internet companies block your access to domain names like “piracy.com,” but you’d still be able to get to them by typing their underlying numerical Internet addresses, like 197.12.34.56. In other words, anybody with any modicum of technical skills would easily sidestep the barriers.)

This is a general problem — there is a reasonable conversation to be had about sites set up for large, commercial operations that are designed to violate copyright. And because there’s a reasonable conversation to be had, Pogue (and many others) simply imagine that the core of SOPA must therefore be reasonable. Surely Hollywood wouldn’t try to suspend due process, would they? Or create a parallel enforcement system? Or take away citizen recourse if they were unfairly silenced?

I think that the overall discussion around SOPA/PIPA/etc. has been valuable and I have certainly learned a lot by following it. I hope that the discussion continues, and more importantly I hope that the discussion continues in the open: as Mike Masnick points out on TechDirt, a major problem with the legislative process here is that our representatives constructed the bill in secret, rather than via open debate.

no hearings, no debate, no discussion. It was a seven minute session that wasn't recorded or available to the public. That's a sign that the fix is in, not that the public is being represented.

Hopefully, the major lesson that everybody learns from this is that things are better when discussion and debate occur openly. Isn't that what open government is supposed to be?

Sunday, January 22, 2012

Tobias Klein's A Bug Hunter's Diary is a simple idea which Klein carries through to execution quite well.

The book is structured as a series of 7 separate chapters; each chapter relates the story of how Klein:

Searches for a vulnerability

Isolates the vulnerability

Develops a demonstration of the vulnerability

Refines the demonstrated vulnerability to produce an exploit of the bug

Each chapter follows roughly the same structure, but the particulars and details of the vulnerability and its exploit are different each time.

Along the way, Klein includes information on extremely valuable tools and techniques, such as: how to use various debuggers to observe software in action; the various types of vulnerabilities such as stack overflows, heap overruns, out-of-range data, etc; how to find sample data files to use as input sources; how and when to write your own quick-and-dirty programs to enumerate possibilities or search for weaknesses, and how to disassemble code to understand its behavior.

Klein's choice of subjects is also impressively broad. The operating systems include Windows, Solaris, Mac OS X, Linux, and iOS; the vulnerable software packages include operating systems, browsers, image processing libraries, and device drivers. This wide ranging approach might be rather overwhelming for a beginning programmer, but this book is not intended for a novice audience; as Klein states at the outset, "you should have a solid grasp of the C programming language and be familiar with x86 assembly."

Klein also provides thorough references and material for additional study; each chapter ends with detailed references and notes to enable the reader to pursue these topics more deeply. Indeed, since this sort of work is best done "hands on", Klein has, commendably, taken the time to precisely note the exact versions of the software he works with so that you can "follow along" on your own machine, setting up the vulnerable software and watching it crash, just as he did.

If you're tired of ordinary programming books, and looking for something a little different, this might be a good book to try. It's got lots of code to read, lots of bugs to understand, and lots of tools and techniques on display. Among all of this, I'm confident that you will find much to learn from, and you'll finish the book resolving, as I did, to practice these skills and improve your programming ability.

It's written in the same style as Hugo Cabret: neither graphic novel nor ordinary prose, it's something in between, with the book's story interspersed between the author's own pencil drawings. The book is doubly-spun, for it tells two inter-related stories, 70 years separated, as it goes.

Although the book touches many subjects, I was particularly drawn to its observations about how we organize our knowledge, whether it be in museums, atlases, and libraries, or just in the spaces we inhabit.

In a way, anyone who collects things in the privacy of his own home is a curator. Simply choosing how to display your things, deciding what pictures to hang where, and in which order your books belong, places you in the same category as a museum curator.

The urge to collect, to organize, and to understand is so universal; it provides a delightful underlying motif in the book.

He noticed a discarded map of the museum next to the sink. He unfolded it and read the names of the halls: Meteorites, Gems and Minerals, Man in Africa, Northwest Coast Indians, Biology of Birds, Small Mammals, Earth History. Like his mom's library, the entire universe was here, organized and waiting.

Selznick particularly captures the plight of the young mind, exposed to so much knowledge, trying to drink it all in:

Ben wished the world was organized by the Dewey decimal system. That way you'd be able to find whatever you were looking for, like the meaning of your dream, or your dad.

Of course, your dad isn't cataloged in a library anywhere; some things we each discover on our own.

What would it be like to pick and choose the objects and stories that would go into your own cabinet? How would Ben curate his own life? And then, thinking about his museum box, and his house, and his books, and the secret room, he realized he'd already begun doing it. Maybe, thought Ben, we are all cabinets of wonders.

If you've got a young reader, just setting forth on his own voyage of discovery and knowledge, reading and exploring and collecting and questioning, I think the two of you will very much enjoy reading Wonderstruck together.

Saturday, January 21, 2012

But since we don't have a firm timeline right now, we'd rather leave this open and get back to you with a definitive date soon (rather than just promise you a date that's far enough in the future that we can feel confident about it). We'll let you know a firm date as soon as we possibly can.

This code attempts to “throw unique_ptr at the problem.” Many people believe that a smart pointer is an exception-safety panacea, a touchstone or amulet that by its mere presence somewhere nearby can help ward off compiler indigestion.

Exception safety is so tricky. I'm reminded of the nest of vipers that is JDBC exception handling, in which code which might encounter a JDBC exception tries to be a good citizen and clean up its resources, but the cleanup processing itself may encounter exceptions, and these exceptions may then conceal the original problem that led to the first exception, making your application an un-diagnosable mess.

Thursday, January 19, 2012

The diagrams are very helpful and the description is quite clear, even amusing:

Program 20 is now curled up in the corner of the computer in a fetal position. Program 10 meanwhile continues allocating memory, and Program 20, having shrunk as much as it can, is forced to just sit there and whimper.

Peter Lawrence can now comfortably boast that one of the biggest and most respected companies on Earth valued his great book at $23,698,655.93 (plus $3.99 shipping).

All kidding aside, it is in fact tremendously hard to write software that uses the available machine resources efficiently and effectively, while still being a "good citizen" when other programs are trying to use the machine, too.

I have a feeling my great-grandchildren will still be trying to design algorithms to do this well...

These things aren't fundamentally related, but they are all interesting in rather similar ways, so I'm noting them all in a single post. Certainly they each deserve much more detailed analyses; who has the time?

Optimize for extreme scale. Use scalable structures for everything. Don’t assume that disk-checking algorithms, in particular, can scale to the size of the entire file system.

Never take the file system offline. Assume that in the event of corruptions, it is advantageous to isolate the fault while allowing access to the rest of the volume. This is done while salvaging the maximum amount of data possible, all done live.

Many people remarked that ReFS seemed to be "bringing ZFS to Windows", so it's interesting to see this recent work on bringing ZFS to Mac OS X. They seem to have spent a lot of time on their web page, but it's hard to find much about the underlying technology. But they're apparently a small young company, so let's give them time.

Now, to complement your filesystems news, here's some database news:

First and most important, don't miss all the talk about Amazon's new DynamoDB:

Wednesday, January 18, 2012

I despise the following experience, which happens to me all too frequently:

I bring up a web site in my browser

The web site starts to load, but initially it is fragmentary, with lots of hidden frames, images, etc. still left to draw.

As the various bits and pieces start to fill in, the page shifts around and adjusts, redrawing some sections, moving and re-shaping other areas.

By this time, I've already spotted the place I want to click on, to get to the next web page, so I click on it

But by the time the computer can process my click and decide what I was clicking on, the web page has re-drawn itself, and the computer decides I've clicked on a different part of the page and takes me somewhere else.

Interestingly, this seems to happen most commonly with ads, so the result is that the computer decides I've clicked on some ad, instead of the useful link that was sitting right next to the ad. (The ad, of course, takes up lots of screen real estate, while the useful link is tiny. So the odds are much higher that the computer interprets my click as being destined for the ad.)

Of course, I didn't want to click on the ad, so I end up going "back", re-clicking on the correct link, and going to the actual page I wanted.

But it makes me wonder how many apparently "real" clicks on ads are actually sluggish-computer-misunderstood-my-click clicks, and how much that is distorting the online advertising industry.

Is it true that, in practice, most of you use AdBlock or or a proxy server or something similar, so this doesn't happen to you because the ads get filtered out before they ever get to your browser?

Doctorow is an entertaining speaker, and he's delivered variants of this speech before, but he does a particularly good job with the speech this time.

He opens by observing that general-purpose computers, like other very general-purpose tools, provide a power than can be hard to comprehend by policymakers:

General-purpose computers are astounding. They're so astounding that our society still struggles to come to grips with them, what they're for, how to accommodate them, and how to cope with them.

...

The important tests of whether or not a regulation is fit for a purpose are first whether it will work, and second whether or not it will, in the course of doing its work, have effects on everything else. If I wanted Congress, Parliament, or the E.U. to regulate a wheel, it's unlikely I'd succeed. If I turned up, pointed out that bank robbers always make their escape on wheeled vehicles, and asked, “Can't we do something about this?", the answer would be “No". This is because we don't know how to make a wheel that is still generally useful for legitimate wheel applications, but useless to bad guys. We can all see that the general benefits of wheels are so profound that we'd be foolish to risk changing them in a foolish errand to stop bank robberies. Even if there were an epidemic of bank robberies—even if society were on the verge of collapse thanks to bank robberies—no-one would think that wheels were the right place to start solving our problems.

But as Doctorow goes on to explain, general-purpose computers are even more powerful than other general-purpose technologies (such as the wheel), because a general-purpose computer can, in our modern world, become any other sort of tool:

The world we live in today is made of computers. We don't have cars anymore; we have computers we ride in. We don't have airplanes anymore; we have flying Solaris boxes attached to bucketfuls of industrial control systems. A 3D printer is not a device, it's a peripheral, and it only works connected to a computer. A radio is no longer a crystal: it's a general-purpose computer, running software.

Doctorow's core point is that we need to be very careful about how we regulate computers, because the general-purpose computer is much more important than many of the social problems that have been ascribed to it so far:

Regardless of whether you think these are real problems or hysterical fears, they are, nevertheless, the political currency of lobbies and interest groups far more influential than Hollywood and big content. Every one of them will arrive at the same place: “Can't you just make us a general-purpose computer that runs all the programs, except the ones that scare and anger us? Can't you just make us an Internet that transmits any message over any protocol between any two points, unless it upsets us?"

JZ: Nothing’s inherently wrong with single-purpose devices. The worry comes when we lose the general-purpose devices formerly known as the PC and replace it with single-purpose devices and “curated” general-purpose devices.

Saturday, January 14, 2012

Over the holidays I had the chance to read Suzanne Collins's trilogy: The Hunger Games. I'm sure you know about these books; they are an international phenomenon, even if I wasn't paying attention for the first couple years of their existence.

These books are extremely powerful. I finished all three in about 10 days of fevered reading, brushing aside friends and family, staying up late and waking up early in order to continue reading, dwelling on them whenever I wasn't actually reading them. It took me several days after finishing the series before I was even willing to start thinking and reflecting about the books, and several weeks more before I was calmed down enough to start writing about them.

I can't remember the last time I was simply desparate to find out what happened next. Captivating, gripping, enthralling: these are words that describe these books well.

It's interesting that the books are published by Scholastic and positioned as "young adult" novels. Although the books are clearly written for a 12-15 year old girl, the topics that are covered are universal: war, discrimination, poverty, alcoholism, government policy, and death are all central to the story.

Certainly there are plenty of teenage girl aspects to the books (the three biggest motifs in the books are: food and our feelings about it; makeup and fashion; relationships with boys), but I never for an instant felt like I was reading "some girl book". After all, it's not like the rest of us don't care about these things.

But, reader bewarned: this is not the sort of book you'll pick up for bed-time reading with your elementary school child. Rather, what you should have in mind is: "what would The Lord of the Flies be like if Stephen King wrote it, and the heroine was a 16 year old girl?"

I feel the comparison to King is apt, for, like him, Collins is a master of the craft: her characters are vivid, her dialogue is precise, and her sense of the pacing and flow of storytelling is ideal. Nothing is awkward, nothing is out of place.

And in case the phrase "young adult fiction" brings to mind watered-down vocabulary, simplified sentences, and an absence of skill and technique, banish that thought from your mind, for this is literature at the highest level. Across the planet, I'm sure that tens of thousands of literature students are hard at work on theses with topics such as: the metaphoric use of fire (has there ever been such a perfect description of puberty as "Catching Fire"?); the linguistic skill and innovation in coinages such as "muttation" (the perfect word for this creature!); and on the symbolic use of the Jabberjay as representative of the modern nation-state's control over the means of communication and the domain of social discourse.

These are amazing books, and Collins is a superb writer. Although they are not for everyone, I hope they find their way into many lives, and are the basis of many discussions and debates about man's inhumanity to man and about what is really important in life.

I know that I will be thinking about Katniss Everdeen, Peeta Mellark, Haymitch Abernathy, and Gale Hawthorne for many years to come; perhaps you will be, too!

Bill Slawski, the Search Engine Optimization expert who does business at SEO By The Sea, has recently been publishing a fascinating series of articles entitled The 10 Most Important SEO Patents.

As you'll discover as you start to read the articles, the series title isn't perfectly accurate, since these aren't necessarily all "SEO patents", but it's no matter, for the articles themselves are extremely interesting.

This provisional patent may not have the weight or legal value of the continuation patents that followed it, but it captures the excitement and personality of its inventor, Larry Page, in a manner that those patents missed. It also provides head-to-head examples of search results from both Google and AltaVista for specific queries to illustrate how the link analysis involved in what Page was doing with PageRank made a difference.

The impact of this patent goes on today, likely responsible for the recent Google freshness update, the possible impact on the rankings of a site as content upon pages change and anchor text pointing to that page no longer matches up well, whether Google might consider some pages as doorway pages when they are purchased and links are added to pages and topics of those pages change, and more.

The patent does provide us with an idea of how a search engine might understand the different blocks that if finds on a page, and use those when it indexes, analyzes and classifies content on that page. For example, a section of a page that contains every short phrases, with each word capitalized, and each phrase a link to another page on a site, that appears near the top of the page or in sidebar to the left of the page might be the main navigation for that page.

the algorithm behind the Reasonable Surfer model might determine that even though the link is prominently placed and stands out from the rest of the text in an important part of a page, the text of the link has nothing to do with the content of the rest of the page, and that text evidences a very commercial intent.

Google, Bing, and Yahoo all look for named entities on web pages and in search queries, and will use their recognition of named entities to do things like answer questions such as “where was Barack Obama born?”

These are really wonderful articles, full of detail, clearly-written, packed with examples, and loaded with link after link to chase and study.

If you are interested in how the world's top search engines work "under the covers", you won't want to miss this series. Thanks Bill for writing such a great set of posts!

Tuesday, January 10, 2012

once in a while, you investigate and find something more. The bad smell is merely a symptom of a larger issue that was otherwise unnoticeable… or, at least, unnoticed. By investigating the smell, you’ve prevented a much bigger issue from shipping

I liked the way the author took the "bad smells" metaphor and worked with it to provide some nice rules of thumb for how to refine your instincts so that you have a better sense of when to dig deeper, and when it would be more efficient to just move along.

This winter's remarkable AO/NAO pattern stands in stark contrast to what occurred the previous two winters, when we had the most extreme December jet stream patterns on record in the opposite direction (a strongly negative AO/NAO). The negative AO conditions suppressed westerly winds over the North Atlantic, allowing Arctic air to spill southwards into eastern North America and Western Europe, bringing unusually cold and snowy conditions. The December Arctic Oscillation index has fluctuated wildly over the past six years, with the two most extreme positive and two most extreme negative values on record. Unfortunately, we don't understand why the AO varies so much from winter to winter, nor why the AO has taken on such extreme configurations during four of the past six winters.

The prior winners of the award look like a fascinating list, including of course Martin Gardner, but there are a number of other authors on that list that I should follow up on. Somehow Martin Gardner was not the first winner of the award, but you certainly can't take anything away from the actual first winner (James Gleick); I've enjoyed several of his books.

It seems like Rachel from Cardholder Services is calling constantly nowadays.

Am I the only one having this problem?

This rather unhelpful story in the Los Angeles Times says that I'm not alone. About the best it can offer, though, is to suggest that you put yourself on the Do Not Call Registry even though the article notes that this doesn't work.

Apparently, though, this is far from the first time the FTC has tried to prosecute such activity. The FTC press release notes that "Cox resides in California but runs his allegedly illegal operation through multiple foreign corporations purportedly in countries such as Panama, Hungary, Argentina, and the Republic of Seychelles" which may be at least part of the reason it's hard to stop.

Well, for the time being I guess I'll just continue to hang up on these clowns :(

Saturday, January 7, 2012

The "Man in the Middle Attack" is a security vulnerability which has to do with intercepting communications without being observed. It has been around, oh, at least 500 years or so.

MITM attacks are always interesting. Did you see the recent MI (Mission Impossible) movie? There's a great MITM attack in that movie (on the 130th floor of the tallest building in the world, no less!).

In the MI movie, two different criminal organizations are conducting a business transaction, in which one organization is selling information and the other is paying money. The MI team arranges a clever deception, pretending to be the seller of the information to the buyer, and pretending to be the purchaser of the information to the seller. The trick works because neither the buyer nor the seller know each other ahead of time, and cannot successfully authenticate each other, so they fall for the trick.

It's classic MI stuff; they used to pull it off in the original TV series oh-so-many years ago. But I find that it's often easier to recognize techniques like this when they are portrayed in an entertaining fashion, as opposed to the more dry, if more technically correct, format in which they are usually discussed.

So, I happened to be reading Andrew ("bunnie") Huang's fascinating paper on MITM attacks in HDCP video transmission. It's deep and complex work, and not easy going; the slides are a much easier way to get an overview of what's going on here.

As bunnie observes, this particular attack is less about cryptography (though there is plenty of that going on here), than about understanding the policy and cultural frameworks within which cryptography and digital rights management are used:

While the applications of video overlay are numerous, the basic scenario is that while you may be enjoying content X, you would also like to be aware of content Y. To combine the two together would require a video overlay mechanism. Since video overlay mechanisms are effectively banned by the HDCP controlling organization, consumers are slaves to the video producers and distribution networks, because consumers have not been empowered to remix video at the consumption point.

Reading bunnie's work is always engrossing, even though most of it goes way over my head. That's the thing about trying to learn stuff: often you work and work and work and maybe you just get a little bit smarter, but that's certainly better than not getting smarter at all.

But there's another issue that you should be paying attention to: Open Access and the Research Works Act.

Start by reading Rebecca Rosen's essay in The Atlantic: Why Is Open-Internet Champion Darrell Issa Supporting an Attack on Open Science?. She calls the bill "a direct attack on the National Institutes of Health's PubMed Central, the massive free online repository of articles resulting from research funded with NIH dollars. ", and gives some great background and pointers to the previous incarnations of this legislation, and why they were problematic.

Move on to a nice essay by John Dupuis of York University in Toronto: Scholarly Societies: It's time to abandon the AAP over The Research Works Act, where he challenges the scientific societies of the world to explain why this act supports their stated mission:

These societies will certainly have among their vision and mission statements something about advancing the common good, promoting the scholarly work of their membership and scholarship in their fields as a whole.

To my mind, The Research Works Act is directly opposed to those goals.

This industry already makes generous profits charging universities and hospitals for access to the biomedical research journals they publish. But unsatisfied with feeding at the public trough only once (the vast majority of the estimated $10 billion dollar revenue of biomedical publishers already comes from public funds), they are seeking to squeeze cancer patients and high school students for an additional $25 every time they want to read about the latest work of America’s scientists.

Saving the best for last, make sure you don't miss Danah Boyd's superb rant: Save Scholarly Ideas, Not the Publishing Industry. She concedes that it's no surprise that the corporate publishers and their lobbyists are using their money and prestige to squeeze the politicians:

the scholarly publishing industry is in the midst of complete turmoil. Its business model is getting turned upside down and some of these organizations are going to die. So I get why their lawyers are trying to grab any profit by any means necessary, letting go of the values and purpose that drove their creation.

But what really bothers Boyd is the behavior of the academics themselves:

How did academia become so risk-adverse? The whole point of tenure was to protect radical thinking. But where is the radicalism in academia? I get that there are more important things to protest in the world than scholarly publishing, but why the hell aren’t academics working together to resist the corporatization and manipulation of the knowledge that they produce? Why aren’t they collectively teaming up to challenge the status quo? Journal articles aren’t nothing… they’re the very product of our knowledge production process.

I've mostly been following the debate as it relates to the Computer Sciences field, but it seems like the issue is much more intense and controversial in the Biological Sciences field, where the major institution is the U.S. Government's PubMed Central:

PMC is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM).

Friday, January 6, 2012

The story, entitled Mining dark fibre, describes how there are firms that specialize in locating hitherto-unused fiber optic cables and putting them to use where there is unmet demand for bandwidth.

Luckily for Ritchie, there are other opportunities to improve latency times, in the form of unused cabling infrastructure around the world. "We're finding unused cables all [the] time, everywhere [from] China and Russia to parts of Brazil," he explains.

There are a number of reasons why unused, or "dark", fibre optic cable might be lying around, he says. "Quite often, when electricity lines are put down, there's underlying optical fibre as well, because if you're digging a hole you may as well whack as many services in there as possible."

So far, this makes plenty of sense to me. I don't find it at all unremarkable that there is plenty of unused fiber optic cable around the world, and keeping track of where it is and who might be able to use it could be a perfectly reasonable and money-making "match-maker" sort of job.

After all, the great Neal Stephenson described all of this more than 15 years ago, in his epic article for Wired: Mother Earth Mother Board.

But then the article goes a little wonky:

sometimes when military objectives change, all of a sudden a bunch of infrastructure becomes available

...

Why would there be a secret substation on the Russian border? You would need to ask the Chinese government.

...

Ritchie is not prepared to reveal precisely how one goes about 'discovering' an unused fibre optic cable in the Mongolian desert. "Are we telling everybody how we do it? No."

What the ? Military objectives? Secret substations? Skull-duggery in the Mongolian desert?

Did the author of the otherwise-bland article just want to spice it up a bit?

Or is there something much darker going on here, slipping by just out of reach from me?

It looks like the Storage Spaces feature of Windows 8 will be very powerful.

Fundamentally, Storage Spaces virtualizes storage in order to be able to deliver a multitude of capabilities in a cost-effective and easy-to-use manner. Storage Spaces delivers resiliency to physical disk (and other similar) failures by maintaining multiple copies of data. To maximize performance, Storage Spaces always stripes data across multiple physical disks. While the RAID concepts of mirroring and striping are used within Storage Spaces, the implementation is optimized for minimized user complexity, maximized flexibility in physical disk utilization and allocation, and fast recovery from physical disk failures.

My family have been using a Data Robotics Drobo device for more than six months now and it's been very successful. These types of low-end, easy-to-use, simple solutions for home storage are wonderful; it's great to see that Microsoft will be providing this as part of their core operating system.

Major League Soccer’s uniform font persists, blocking the orderly progression of good design.

North American leagues largely allow teams leeway in customizing the look of their numbers and letters; that’s how you get Red Sox “3”s, Laker “3”s old Blue Jay “3”s and Steeler “3”s. Thousands of colleges and minor league teams have unique takes too. On the world stage, soccer is far from uniform, and international competitions showcase a range of interesting font design. Conversely, the English Premier League has been perfecting the font lockup for at least 15 years.

Gotta love that double entendre on "uniform font" :)

I actually own one shirt (a bit of swag from my day job) with my name on the back shoulders; tonight I must remember to go have a look at the font...

It's an extremely quick read, short and fascinating, vivid and well-written, but oh my is it grim and depressing. The book is structured as five stories, about five representative individuals that the author meets and gets to know in some detail: a businessman, an engineer, a farmer, a factory worker, and a waitress. Each story is compelling and intriguing.

But the book is really about India as a whole, as the subtitle suggests, and unfortunately India as a whole is described as a terrifyingly awful place:

The data collected by the Indian government, which has been subject to some controversy for its tendency to downplay the number of poor people and the extent of their destitution, is nevertheless stark. In 2004-5, the last year for which data was available, the total number of people in India consuming less than 20 rupees (or 50 cents) a day was 836 million -- or 77 per cent of the population.

...

They live in slums, are expected to be available to work around the clock, and are denied access to the ration cards that would allow them to buy subsidized food from what remains of the country's public distribution system. And although they are everywhere -- huddled in tents erected on pavements and under flyovers in Delhi; at marketplaces in Calcutta, where they sit with cloth bags of tools ready for a contractor to hire them for the day; gathered around fires made from rags and newspapers in the town of Imphal, near Burma; and at train stations everywhere as they struggle to make their way into the 'unreserved' compartments offering human beings as much room as cattle trucks taking their passengers to the slaughterhouse -- they are invisible in the sense that they seem to count for nothing at all.

If you are interested in India, this is a great book, but prepare to do some serious soul-searching about the future of the human race as you read it.

Wednesday, January 4, 2012

Here's a great article about the various subtexts and hidden meanings that have, over the years, been conveyed through the selection and positioning of postage stamps on postcards and envelopes.

I'm not sure I completely believe all the various claims that the article makes, but it's well-researched and well-documented and very fun to read.

The custom of the language of stamps reached different ages in different countries. In Russia, where it was a great fashion, no such postcard was published after the revolution, just as in the socialist countries after 1945. On the one hand, etiquette itself was considered a bourgeois left-over, and on the other hand the power did not tolerate any encoded message either. In western European countries, however, we find its instances as long as the end of the sixties.

Woe unto those who thoughtlessly allowed their eight-year-old granddaughter to affix the stamp to the envelope, thus unintentionally sending the message: "I have discovered your deceit", when they of course meant to convey: "Many thanks for your kindness"!

As I (slowly) emerge from my holiday stupor, another short post on a collection of related topics:

CodeAcademy, an interesting new startup backed by Fred Wilson's Union Square Ventures, is seeing impressive signups for their new online programming course: Code Year. For the time being, I'm not doing much web programming, so I didn't sign up, but if you did, let me know what you think.

Meanwhile, Christian Heilmann is concerned about the quality of introductory programming language instruction, and writes a detailed critique of one such beginners's tutorial: Teach Them How To Hit The Ground Running And Faceplant At The Same Time?. Heilmann follows up his critique with another posting on his personal blog a few days later: Beginner tutorials who don’t help beginners?. In the two articles, Heilmann makes several points, but the ones which stuck with me were: (a) worry more about teaching people than about making them happy, and (b) don't take short-cuts at the start which will need to be un-learned soon afterwards:

“Quick tutorials for beginners” are killing our craft. Instead of pointing to existing documentation and keeping it up to date (in the case of the wiki-based docs out there) every new developer turned to an author wanting the fame for themselves. And a lot of online magazines cater to these to achieve “new” content and thus visitors. We measure our success by the number of hits, the traffic, the comments and retweets. And to get all of that, we want to become known as someone who wrote that “very simple article that allowed me to do that complex thing in a matter of minutes”.

As Heilmann points out, Smashing Magazine is generally not known for these sorts of things, and is usually a great source of quality educational material. It's pleasing to see that the Smashing Magazine editors are hosting this discussion on their site, clearly they agree that it's an important problem.

It should be noted that while we're starting with JavaScript as a first language - largely due to its ubiquity, desirability in the larger workforce, lack of prior installation requirements, and ability to create something that's easy to share with friends - we're not going to be myopic and only focus on JavaScript. There are so many things that can be learned from other languages, not to mention entire skillsets that aren't terribly relevant to in-browser JavaScript, that it behooves us to try and bring as much of them into our curriculum as possible.

I don't know what the future will bring, but I am extremely confident that it will bring a need for more programmers, and for more people who at least understand programming, even if they don't practice it full-time themselves.

So it's great to see the field continuing to discuss, take responsibility for, and invest in the teaching and training of new programmers.

Tuesday, January 3, 2012

I saw the movie in 3D and loved it, but it clearly would have been just as good, and possibly better, in 2D.

I found many parts to love about the movie (I am a huge Herge's Adventures of Tintin fan!), but to pick one particularly special part, I love the scene early in the movie where Tintin is getting his picture painted by a street artist in the market, and as the portrait is completed we see that the picture is the exact image of Tintin from Herge's original books, and then the camera pulls away to show a wall where dozens of street portraits have been pinned up, each one of them a character from the books, done in the original Herge style.

It's just an elegant and graceful nod to the fact that the Spielberg/Jackson team are aware that yes, they have made a hyper-realistic movie of a comic book.

the big mapmaking corporations of the world employ type-positioning software, placing their map labels (names of cities, rivers, etc.) according to an algorithm. For example, preferred placement for city labels is generally to the upper right of the dot that indicates location. But if this spot is already occupied—by the label for a river, say, or by a state boundary line—the city label might be shifted over a few millimeters. Sometimes a town might get deleted entirely in favor of a highway shield or a time zone marker.

...

But Imus—a 35-year veteran of cartography who’s designed every kind of map for every kind of client—did it all by himself. He used a computer (not a pencil and paper), but absolutely nothing was left to computer-assisted happenstance. Imus spent eons tweaking label positions. Slaving over font types, kerning, letter thicknesses. Scrutinizing levels of blackness.

In Cartastrophe, Daniel Huffman offers his critique of Rick Aschmann's complex and sophisticated dialect map: North American English Dialects, Based on Pronunciation Patterns. Huffman's conclusion:

The more complexity you can show, the richer the story and the more versatile the product. The map quickly begins to be more than the sum of its parts. Putting two thematic layers on a map gives you three data sets — one each for the layers, plus allowing you to visualize the relationship between the two layers. One plus one equals three. But all of this is worthless if it becomes so complex as to be unclear. A map with one clear data set is worth more than a map with fifteen data sets you can’t read.

Over at BldgBlog, Geoff Manaugh offers an intriguing look at the complexities of collecting mapping data on ice floes: Ice Island Infrastructure.

In other words, they want to turn icebergs into floating science research stations, mapping earthquakes at sea.

...

While the authors compare this, briefly, to using buoys—and, thus, this method is not all that different from any other free-floating oceanographic instrumentation system—the transformation of icebergs into scientifically useful platforms is a compelling example of how a natural phenomenon can become infrastructure with even the smallest addition of equipment. The iceberg has literally been instrumentalized: a temporary archipelago, too short-lived to appear on maps, turned into a scientific instrument.