Life through nerd-colored glasses

I spent the last two days in a very interesting discussion group about visualization challenges for ALMA. ALMA is arguably the first observatory where the data products will routinely lie in “big data” territory — that is, the Gigabyte-Terabyte range where data sets can’t easily be analyzed on a single machine. We’ve created observational datasets this large before, but they have arguably been niche products that only a few researchers use in their entirety (large swaths of the entire 2MASS or Sloan surveys, for example). Many, many people who use ALMA data will have to contend with data sizes >> RAM. The community needs to come up with solutions for people to work with these data products.

The big theme at this discussion group was moving visualization and analysis to the cloud, where more numerous and powerful computers crunch through mammoth files, and astronomers interact with this resource through some kind of web service. We spent a lot of time looking at a nice data viewer and infrastructure developed in Canada that is great for browsing through 100GB (and larger) image cubes. Yet I find myself uneasy about this move to the cloud. I seemed to be in the minority within the group, as most others embraced or accepted this methodology as the inevitable future of data interaction in astronomy (I may or may not have been called a dinosaur — admittedly, I was being a bit obnoxious about my point!).

I get that cloud computing is unavoidable at some level — most astronomers do not have nearly enough computational resources or knowledge to tackle Terabyte image cubes, and we will need to rely on a centralized infrastructure for our big data needs. Centralized resources are also great for community science, where lots of people need to work on the same data. But in an attempt to defend (or at least define) my dinosaur attitudes, here are the issues that I think astronomy cloud computing needs to address:

Scope of access: How often and to what extent will an observer have access to cloud resources? Will she be able to visualize data whenever she wants? Will she be able to run arbitrary computation? How much of a lag will there be between requests and results? Many of us are used to a tight feedback cycle when visualizing, analyzing and interpreting data. Is it a priority to preserve this workflow? Is that technologically and financially feasible?

Style of access: How many ways will we be able to interact with data? What restrictions will be placed on the computation and visualizations we undertake? Will we be able to download smaller sections of the data product for exploration offline? Will this API be in a convenient form (python library, RESTful URL, SQL) or some more awkward solution (custom VO protocol, cluttered web form)? What will the balance be between GUI and programmatic access? How well will each be designed and supported (personally, I can tolerate a poor GUI interface much more than a bad programming library)?

Bottlenecks for single machines. Underlying all of this is is the assumption that it is impossible to work with ALMA data on local machines. I think this is overhyped in some aspects. Storing even a Terabyte of data is trivial (1 Tb hard drives are $100, compared to $2000 per year to store 1 TB on Amazon’s cloud, to say nothing of computation). While churning through all of this data is certainly a many-hour task with a single disk, many operations relevant for visualization, exploration, and simple analysis are trivial (extracting profiles, slices, and postage stamps on a properly indexed data cube is very cheap, and gives you a lot of power to understand data and develop analysis plans). Should we really fully abandon this workflow that almost all astronomers currently use? Is it worth developing new software to help interact with local data more easily?

By no means are these issues insurmountable, and I was probably sweating the details too much for the high-level discussion at the meeting. But the details do matter, and the Astronomical community has had a mixed track record with creating interfaces to remote data products (new visualization clients are getting pretty good, but services for analysis or data retrieval are still pretty cumbersome). My reaction to most of these clumsy products has been to avoid them, because it has been possible to fetch and analyze the data myself. Once we lose that ability, we will all become very dependent on external services. At that point, the details of remote data interfaces may become the new bottleneck for discovery.

Update: Paul Van Slembrouck, the designer of this graphic, has responded to the critique. Be sure to read his comments below!

I am a teaching fellow for a class at Harvard called “The Art of Numbers,” which teaches principles of data presentation to undergraduates from all concentrations. For a recent midterm, students were asked to analyze this graphic from Visual.ly:

Distribution of education levels for women who divorced in 2008

For valentine’s day Visual.ly posted a series of visualizations of divorce statistics in the U.S.. Several aspects about this graph bothered me, and I thought it would make for a good exam question.

The Hockey Stick Plot is one of the most iconic and controversial plots related to climate change. It shows the change in average temperature over the past few thousand years. Here’s a version from Wikipedia

Temperature Change Over Time

One little aspect of this graph irks me. Let me be clear up front: I am not questioning the science behind this plot, or the conclusions drawn from it. I tend to trust consensuses (consensi?) in the scientific community. I also tend to think that people who accuse scientists of swindling the public for their own personal gain don’t understand the attitudes within the scientific community.

My little problem

The most striking feature of this plot is the rise in temperature over the last 150 years or so — the scale of this change is larger than other natural variations on ~100 year timescales, and strongly suggests an external influence (i.e. humans).

My problem is that the hockey stick diagram is often used to implicate the Industrial Revolution of the 1800s. After all, the knee in this diagram occurs right around 1800. David MacKay, in his great book on energy consumpation, even goes so far as to label the year that James Watt invented his steam engine (note he’s using a different proxy for climate change — CO2 concentration instead of temperature change):

From David Mackay's "Sustainable Energy -- Without the Hot Air".

I don’t doubt that the Industrial Revolution marks a significant milestone in human climate change. However, I am less convinced that these diagrams really show that.

Here’s my reasoning. Human population growth has been roughly exponential over time:

Human Population Growth Over Time (Data from Wikipedia)

This is slightly steeper than exponential, but there’s no sharp knee at 1800. Likewise, most of the things that humans produce (and pollute with) have also grown exponentially over time — electricity, computers, tires, etc. Human growth has traditionally been exponential.

Given this simple observation, my naive intuition would be that the historical temperature record would be broadly described by the sum of two trends: a flat line representative of the earth’s equilibrium temperature, and an exponential curve that encapsulates the growing impact of humans.

This simple model can reproduce the general shape of the hockey stick pretty well:

This model doesn’t account for the bumps and wiggles (due to natural climate oscillations about the equilibrium). But the key point here is this: neither term in this model has a characteristic time scale. The “knee” in the graph represents the time when the exponential human-factor starts to overwhelm the constant term, but there’s nothing special about how the ‘human factor’ is changing around the time of the Industrial Revolution.

Perhaps it’s a nitpicky point to make, but the hockey-stick diagram on its own doesn’t isolate the Industrial Revolution as the cause of human-induced climate change (other analyses might, of course). Instead, it points to 1800 as the time when human growth (industrial, agricultural, whatever) became significant on a global scale. A more convincing indictment of the Industrial Revolution (and not population growth in general) would isolate the human contribution — and show a knee around 1800. This would more directly show that ‘something changed’ in a distinct way when the Industrial Revolution began.

After a Hurricane or other natural disaster, debris poses a huge problem; it knocks out vital infrastructure like electricity, and blocks roads — inhibiting the work of rescue crews and cutting off victims from access to hospitals, supplies, and evacuation routes. It is vitally important that disaster relief efforts have plans for using their limited resources to efficiently clear debris off roads — giving aid to as many people as possible, as quickly as possible.

This was the scenario the Harvard Institute for Applied Computational Sciences presented to two teams of graduate students last week, as part of their first computational challenge. They were given digitized road maps of Cambridge, MA, information about the population density, and a realistic projection of the road debris that would be left behind after a major Hurricane. Each team was given two weeks to design an algorithm to efficiently clear debris, minimizing the amount of time people are cut off from access to local hospitals. I was a member of one of these teams.

Within this scenario, we have enough resources to clear a limited amount of debris each day. All of the bulldozers start off at two local hospitals (admittedly somewhat contrived — in a real scenario, relief aid would likely work their way in from outside the disaster area). At any given time, we can only clear debris on the roads immediately adjacent to roads that we have already cleared — that is, we can’t magically airdrop bulldozers into the most heavily damaged areas. Instead, we have to clear our way to these areas.

“Solutions” to the problem consist of a schedule of which roads to clear, in which order. Whichever solution is most efficient in giving as many people fast access to hospitals, wins.

Coming up with a Solution

In principle, this problem can be solved very easily — just consider every possible schedule for clearing the roads, and choose the one which restores hospital access most quickly. Unfortunately, this approach is utterly infeasible — our map of Cambridge has 604 road segments, yielding about 604! or options (thats a 1 with 1420 zeros after it). Even on the fastest computers in the world (or the future, really), this calculation would take far longer than the current age of the universe to complete. We need a more intelligent way of searching through possible solutions.

The approach we came up with turns out to be highly effective. It involves altering a given schedule to generate a new, similar, and possibly better solution. We called this “nudging” the schedule, and it works as follows:

Take an initial schedule that solves the problem, but in a non-optimal way (these are easy to come up with).

Truncate the schedule at some point, keeping only the first N decisions.

Determine which points in the city do not have access to a hospital after these steps. Choose one at random.

Clear out the most efficient path (the minimum-weight path, in Graph-algorithmic jargon) from one of the hospitals to this location. Add these decisions immediately after the partial-schedule from step 2.

Add the rest of the decisions that were truncated in step 2 (paying careful attention to not duplicate any work that you did in step 4).

Nudging solutions has a lot of nice properties: its fast (we can easily do it ~100 times per second with our relatively slow python code), and provides a way to re-prioritize a schedule, since the location chosen step 3 is rescued earlier in the new plan than it was in the old. It isn’t too hard to convince yourself that, with enough nudging, it is possible to arrive at the globally optimum solution from any starting solution. And the choice of using the most efficient path in step 4 tends to create effective schedules which don’t waste resources.

Equipped with a strategy of nudging solutions, there are many algorithms which can search for the best strategy. We chose Simulated Annealing, which works more or less as follows:

Start with a schedule S1

Nudge S1 to generate a new schedule, S2

If S2 is a better schedule, throw away S1

If it is worse, then throw away S2 with some probability related to how much worse of a solution it is

Repat the process with the solution that wasn’t discarded

The rejection probability in step 4 is gruadually lowered throughout the process — at the beginning, almost all solutions are accepted, allowing the algorithm to explore a wide range of possible scenarios. As the probability drops, the solution is gradually confined to better and better solutions.

The algorithmic showdown

After two weeks of development, each team applied their algorithm to a slightly modified map of cambridge. We were given 3 hours of computing time on one of Harvards super-computers to run our algorithm and come up with the best possible strategy.

Our solution turned out to be very effective. After only a few minutes of nudging, we had found a solution better than the competing team’s final answer. Furthermore, our strategy out-performed the solution generated by the competitions’ organizers from Georgia Tech, which was previously thought to be near-optimal.

How nudging solutions with simulated annealing decreases the penalty function, as a function of time. Each black line depicts a series of nudges as a function of computation time. The penalty function relates to how long each resident is stranded without hospital access (lower numbers are better). The penalty function corresponding to the organizers' solution is drawn in green. The red line depicts a strict lower limit to the penalty -- no solution can be better than this, given the amount of debris on the roads.

We can put our performance in more concrete terms: with our strategy, the average resident is stranded without hospital access for 2 days and 18 hours. The previous ‘optimal’ strategy kept the average resident waiting for 2 days and 21 hours, and naive strategies (i.e. clearing off roads at random) will keep residents waiting for over 4 days on average. These extra hours would cost many lives, since common post-disaster health problems like dehydration and cholera progress on the timescale of hours to days.

I’m pretty satisfied with our work these past few weeks. This approach isn’t too different from some of the data analysis tasks I tackle within astrophysics, and it was great to see these same techniques successfully handle a problem with real humanitarian benefit. I hope our solution gets taken under consideration in future disaster relief research — to encourage this, I’ve posted our code online.

The trouble with big issues is that… well, they’re big. Big issues have many facets, which generate many different (and often contradictory) points of view among those who care. This creates problems when people use data to think about big issues. I want to focus on two:

A few months ago (at the height of the fighting and deadlock over the national debt ceiling), I decided to investigate how partisan the United States congress really is — that is, how often do congresspeople vote along strict party lines. Perhaps I’m naive, but I find it frustrating that politicians seem so much less willing to compromise than other groups of people. I would prefer a government where concern for the common good takes priority over the culture wars.

With the 2012 elections approaching, I wanted to get a better handle on how willing different congresspeople are to compromise. Fortunately, the congressional voting history is public, and govetrack.us provides convenient access to both browse and download these data.

After downloading these data into a local SQL database, I decided to take the following approach:

For each vote since 2008, count up the total yes/no counts from the Democrat and Republican parties

Calculate the ‘net’ party vote for each of these. A value of 1 indicates a party is unanimously in favor of a motion, bill, etc. A value of -1 indicates unanimous disfavor. A value of 0 indicates a deadlock, with half of the party voting yes and no.

Compare the net party votes for each party.

Here’s one way to visualize this data, for the House of Representatives:

Each colored circle represents a single motion, bill, etc. The x/y locations give the net party vote for democrats/republicans. The color of the circle indicates, in cases where the majority opinion for each party differs, which party won (red=republican, blue=democrat, black= agreement among both parties). So, for example, all points in the upper right and lower left corners are black, since both parties agreed on a yes or no vote. The upper left and lower right corners are regions of disagreement between the parties.The histograms show the overal distribution of democrat (top horizontal graph) and republican (right vertical graph) votes, again color-coded by which party won the vote.

I noticed a few interesting features here:

Very many votes are (nearly) unanimously accepted by both parties. I found this to be refreshing, as it suggests that Washington is not as deadlocked as the news may imply. Of course, many of these motions are hardly controversial. Take the title of one such resolution, passed in the House by a vote of 421-2: “On Passage – House – H.R. 2715 To provide the Consumer Product Safety Commission with greater authority and discretion in enforcing the consumer product safety laws, and for other purposes – Under Suspension of the Rules.” A good next step would be to filter out the most benign / procedural votes, to better see ideological divides.

The votes for both parties are clustered around unanimous support or rejection, with few points near the center of the graph. I wish there was more disagreement within each party — dissenting opinions generally seem like a good idea.

The Republican party indeed has been more of a “Party of No” lately, with a greater concentration of “No” votes. Interestingly, however, Republicans have had more success against Democrats when voting yes for something — Republicans have lost most of the votes where they have voted nearly-unanimously “No” against a Democrat “Yes.”

I haven’t calculated how partisan individual members of congress are — that’s coming soon. In the meantime, here’s the voting record for the Senate (many more blue points, due to the steady Democrat majority since 2008)

Obviously, a lot. The first image is a map of the rho-Ophiuchus molecular cloud — one of the nearest sites of star formation. The second is, apparently, random noise.

There is one important way in which these images are similar, however. Here is the histogram of the pixel brightnesses in each image:
The two images have the same distribution of pixel values. In fact, the “noise” image is simply a scrambled version of the first image. They contain identical pixels, arranged in different order.

Who cares? Well, this illustrates a common limitation to using a histogram to characterize data. It turns out that most maps of molecular clouds have similar histograms — that probably says something interesting about the physical processes that determine cloud structure. However, as the images above show, similar histograms can hide a lot of interesting differences between two data sets.

Histograms contain no information about the arrangement of pixels in an image — that’s why I could scramble the pixels in rho-Oph and preserve the histogram exactly. But there are other ways to rearrange those pixels. How about this, for example?

Again, the histogram this image is identical to the first two (download the data yourself if you don’t believe me!). The strategy for transforming an image while preserving the histogram turns out to be pretty simple. Here’s the strategy:

1) Find an image you want to match (in the case above, I used this)
2) If necessary, crop/resize the image to match the dimensions of the original image.
3) Find the location of the faintest pixel in the target image.
4) Replace this pixel with the faintest pixel in the original image.
5) Repeat for the second faintest, etc, until you replace all the pixels.

I put together a Processing applet that demonstrates this for a bunch of different images. You can find it here. This applet also shows you how the pixels in either image correspond to each other — hover your mouse over a pixel in one image to see the location of that pixel in the other.

You can even do this with color images by modifying step 4. Instead of simply substituting pixels in that step, alter the brightness to match the original image, while preserving the color. This will create a histogram of brightnesses that matches the first image, with colors that match the second.

There isn’t anything too profound going on here (although it always surprises me how well this works). But it does highlight the limitations of histograms in rather stark fashion. It’s interesting that maps of star forming regions all possess similar histograms, but this does not rule out the possibility that these regions have interesting structural differences between them. Complementary techniques are needed to tease out this possibility.

In the mean-time, enjoy this picture of the Bieber-ized rho-Ophiuchus (any guesses to what the histogram looks like?).