Murray's Journal

Tuesday, December 17, 2013

A new version 0.3 of HistogramTools is now on CRAN. HistogramTools provides a number of methods for manipulating histograms, measuring the distance between histograms, calculating the information loss due to binning aggregate data sets, and other tools useful for statistical analysis of binned/histogram data. It also uses RProtoBuf to provide a protocol buffer representations of the default R histogram class to allow histograms over large data sets to be computed and manipulated in a MapReduce environment with tools written in other languages.

Friday, June 7, 2013

Later this month I'll be at the USENIX Annual Technical Conference in San Jose with some coauthors on the Storage Analytics and Colossus teams at Google to present some of our recent work on optimizing flash provisioning for cloud storage workloads. Our paper is titled "Janus: Optimal Flash Provisioning for Cloud Storage Workloads", and a pre-print is available from Google Research.

This work is about using statistical samples of I/O patterns from a large distributed filesystem to formulate and solve an optimization problem that helps us allocate flash better in our datacenters. I'm looking forward to returning to USENIX ATC as it has been nearly 10 years since I've been to this conference. Shoot me a mail if you will be there and want to meet up.

The first paper describes some of the work we've done on forecasting storage growth in datacenters for capacity planning purposes using ensemble forecasting methods and trend-change detection. It builds on some of the earlier work we did for websearch traffic forecasting and, to a lesser extent, building a market economy for datacenter resources.

The second paper, to which I made only minor contributions, is a more mathematical description of a method of quantifying the uncertainty in aggregate metrics from a sampled RPC tracing system for large-scale distributed systems (e.g. Dapper).

Both papers are addressing problems that usually come up in very large-scale distributed systems, and the applicability is somewhat limited in smaller contexts, but I would be very interested in feedback regardless.

Thursday, October 28, 2010

It's been nearly a year since I posted here and much has changed. The obvious and most important change is a second new addition to our family which I've been blogging about elsewhere. On the work front I was able to publish a paper about some of my work studying the Availability in Globally Distributed Storage Systems at Google last year. This is an exciting space given the growth of cloud based storage services and sophisticated distributed storage software.

I've been blogging a little more regularly about work-related topics on Google company blogs, with four posts so far this year :

As you can see I've been working on data analysis, distributed cloud storage, and open source, along with some other projects I'm not yet able to talk about. I'll try to post more updates about some of my interests and side projects in the remainder of the year.

Sunday, January 10, 2010

Amazon has been doing a really great job at selling excess compute capacity in their datacenters through products such as Amazon Elastic Compute Cloud (EC2), Elastic MapReduce, and their simple and structured distributed storage products. The economics of this kind of model, as represented in the two graphs here are clearly compelling. Instead of buying large numbers of computer to mostly sit idle, new start-up companies, researchers, and individuals can rent the excess capacity from Amazon instead. Last year I worked on some related ideas for internal pricing and provisioning of resources at work. This was my first direct experience with the Amazon consumer offerings however, and I was impressed. It took less than half an hour last night to sign up, start a few basic Linux instances, copy some application code over, compile it, and begin running it on the Linux Xen instances.

The results of cheap on-demand distributed computer clusters and a global english language work force that can be paid by the task almost engender too many business ideas to contemplate.. If only there were more hours in the day..

Sunday, June 7, 2009

Simon Singh has been sued for libel by the British Chiropractic Association. Simon is an author, journalist, and TV producer who works to popularize math and science. I had the opportunity to hear Simon speak about an earlier book on the Big Bang at Keble College, Oxford. Simon wrote a more recent book on alternative medicine and suggests that there is no evidence for the efficacy of chiropractic treatments for asthma, ear infections, and other infant conditions. British Libel laws are more strict than those in the U.S. and this scientific debate has unbelievably been construed as a form of libel. Read more about the dispute and sign the petition here.