Service interruptions of digital systems can inconvenience millions of people and have a significant financial impact on the provider. If the Amazon web site, or Google’s Gmail, or the Visa payments network goes down even for a few minutes, it can make front page news.

As digital systems grow larger and more complex, it can become very challenging to monitor their health and status, which is the first step in detecting potential problems, identifying the root causes, and taking appropriate preventive actions. These types of systems can contain thousands of different data feeds, data flows and processes. A problem with just one of them can interrupt payments, ads, and status updates, respectively. Often there are hourly, daily, weekly and seasonal variations in the data that complicates the detection of problems.

One way to gain some insight into this problem is to look at the origins in the 1920’s of statistical quality control. Walter Andrew Shewhart (1891 – 1967) was an engineer at the Western Electric Company, which manufactured hardware for the Bell Telephone Company, from 1918-1924. From 1925 to 1956 he was a member of the Technical Staff of Bell Telephone Company [ASQ].

One of the problems that concerned him was identifying potential problems in factory assembly lines. For example, the dimensions and weight of metal parts that are sampled from an assembly can be recorded. He distinguished between two types of variations in these measurements:

Common cause of variation (or noise) occurs as a normal part of the manufacturing process.

A special cause of variation is not part of the normal manufacturing process, but represents a problem.

One of the goals of statistical quality control is to distinguish between these two types of variation and to quickly identify special causes of variation.

Shewhart introduction control charts as a tool for distinguishing between common and special causes of variation. A control chart had a central line and upper and lower control limits. When the measurement exceeded either the upper or lower control limits, it was considered a potential special cause of variation and investigated. Usually, the upper and lower control limits were three standard deviations above and and below the mean.

As anyone who has investigated potential data quality problems knows, identifying roots causes of potential problems is not easy and Shewhart also introduced a four step approach to these types of investigations that became known as the Shewhart Cycle, the Deming Cycle or the Plan-Do-Check-Act Cycle:

Plan. Identify an opportunity or potential problem and make a plan for improving it or changing it.

Do. Implement the change on a small scale and collect the appropriate data.

Check. Use data to analyze statistically the results of the change and determine whether it made a difference.

Act. If the change was successful, implement it on a wider scale and continuously monitor and improve your results. If the change did not work, begin the cycle again.

These same ideas are still used today as the basis for health and monitoring systems. Well designed digital systems these days are designed from the ground up so that appropriate log data is produced. Instead of a single assembly line producing physical items, there are thousands or millions of digital processes producing (nearly) continuous digital data. Often this data is available through an http interface and is continually collected.

This is dashboard from the open source Augustus system for health and status monitoring.

Instead of a control chart, a change detection model is used, such as a CUSUM or GLR statistical model [Poor]. Instead of building a single model, a model for each cell in a multi-dimensional cube of models is built [Bugajski]. Instead of looking at the charts each day, an online dash board is used that is at the hub of an operations center.

Baseline and change detection models for each cell in a multi-dimensional data cube of models can be built easily using the open source Augustus system.

Predictive models have the power to create new markets, a power that is not all that common in technology. This is the first of several posts that contain case studies describing how companies have used predictive analytics to create new markets.

Motorcycle Insurance. For many years, it was difficult if you drove a motorcycle to get insurance. Drivers of motorcycles have more accidents than other drivers and the simplest course of action is simply not to insure them. For most drivers, simply knowing a few facts about them, such as their gender, age, and the number of miles driven to work, was enough information so that an insurance company could set premiums for an insurance policy.

This segment of the automobile insurance market was called the “standard segment” and includes 80%-90% of the market. The other segment, called the “nonstandard segment”, includes drivers with accidents, drivers of motorcycles and high performance cars, older drivers, and younger drivers. Most insurance companies in the 1950’s – 1980’s focused on the standard segment. The standard market was quite competitive during this period.

Progressive Insurance took a different tack in the 1970’s. It developed an analytic model that could quantify the risk of some who drove a motorcycle and then priced the policy accordingly. Motorcycle insurance was part of the nonstandard segment and there was much less competition in this segment. This segment also had a higher barrier to entry since pricing premiums (well) in this market required (simple) analytic models.

By developing an appropriate risk model Progressive was able to create a new market, which became an important driver of its growth in the 1970s. From 1975 to 1978, premium income grew from $38 million to $112 million, as Progressive solidified its leadership in the nonstandard market.

Online Text Ads. Google introduced their online text ads (to the right of search results) in January 2000. The ads were sold on a cost per thousand impressions (CPM) by a sales representative. This was the way most ads were sold at that time, although banner ads (not text ads) were the dominant form of online advertising. These ads didn’t generate a lot of money at the beginning.

In the Spring of 2000, the online banner ad market crashed. In response, Google changed its business model to a self-serve model. With a self-serve model, ads were not sold by a sales representative but instead through an online, self-serve web page. It got this idea from GoTo.com (which later became Overture, which later was bought by Yahoo!).

In October, 2000, Google introduced AdWords with the slogan: “Have a credit card and 5 minutes? Get your ad on Google today.” Ads were still priced by CPM, but there was no longer a sales representative. (By the way, Amazon used the same model in August, 2006 when it introduced EC2. With a credit card and 5 minutes you could get use an online computer and pay by the hour.)

In 2001, Google’s AdWords revenue approached $85 million, but was much less than Overture’s revenue which earned $288 million. In contrast to Google’s use of a CPM model, Overture used an auction model: the higher you bid for an ad the more likely your ad would appear.

A problem with the auction model as employed by Overture was that high bids could force an ad to the top but no one necessarily clicked on it unless it was relevant. Relevance using information extracted from text was something Google understood well.

Google built a predictive model (a response model) to predict whether a given user would click on a given ad. Google then integrated this response model with the rankings provided by the auction. Ads with higher expected responses would be moved up higher in the rankings, and those with lower expected responses would be moved lower in the rankings. Ads with the highest rankings would then be displayed. This new model that integrated an online auction with a response model was introduced into AdWords in February 2002.

From a modeling perspective, Google has introduced two disruptive technologies in modeling: 1) PageRank; and 2) integrating relevance through a response model into pay-per-click auctions for online ads.

With this integrated model, Google created a new market (online text ads) that it has dominated since 2002.

Source: John Battelle, The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture.

Health and Status Monitoring. Over the past several years, Open Data has developed predictive analytic models to monitor the operations of complex systems, such as data centers, network operations centers, world wide payment systems, etc. I’ll describe these types of models in a future post.

For more than ten years, I have been teaching R both formally and informally. One thing that I find often trips up students is the use of R’s accessors and mutators. ( For those readers not from a formal computer science background, an accessor is a method for accessing data in an object usually an attribute of that object.) A simple example is taking a subset of a vector:

letters[1:3]
[1] "a" "b" "c"

As you can see, the result is a character vector containing the first three letters of letters vector.

Good programming languages have a standard pattern for accessor and mutators. For R, there are three: [, [[, and $. This confuses beginners coming from other programming languages. Java and Python have one: ‘.’. Why does R need three?

The reason derives from R’s data centric view of the world. R natively provides vectors, lists, data frames, matrices, etc. In truth, one can get by using only [ to extract information from these structures, but the others are handy in certain scenarios. So much so that after a while, they feel indispensible. I will explain each and hopefully by the end of this article you will understand why each exists, what to remember and, more importantly, when to each should be used.

Subset with [

When you want a subset of an object use [. Remember that when you take a subset of an the object you get the same type of thing. Thus, the subset of a vector will be a vector, the subset of a list will be a list and the subset of a data.frame will be a data.frame.

There is one inconsistency, however. The default in R is to reduce the results to the lowest dimension, so if your subset contains only result, you will only get that one item which may be something of a different type. Thus, taking a subset of the iris data frame with only one column

class( iris[ , "Petal.Length" ] )
[1] numeric

returns a numeric vector and not a data frame. You can override this behavior with the little publicized drop parameter, which indicates not to reduce the result. Taking the subset of iris with drop = FALSE

iris[ , "Petal.Length", drop=FALSE ]

is a proper data frame.

Things to Remember:

Most often, a subset is the same type as the original object.

Both indices and names can be used to extract the subset. ( In order to use names, object must have a name type attribute such as names, rownames, colnames, etc. )

You can use negative integers to indicate exclusion.

Unquoted variables are interpolated within the brackets.

Extract one item with [[

The double square brackets are used to extract one element from potentially many. For vectors yield vectors with a single value; data frames give a column vector; for list, one element:

letters[[3]]
iris[["Petal.Length"]]

The mnemonic device, here is that the double square bracket look as if you are asking for something deep within a container. You are not taking a slice but reaching to get at the one thing at the core.

Three important things to remember:

You can return only one item.

The result is not (necessarily) the same type of object as the container.

The dimension will be the dimension of the one item which is not necessarily 1.

And, as before:

Names or indices can both be used.

Variables are interpolated.

Interact with $

Interestingly enough, the accessor that provides the least unique utility is also probably used the most often used. $ is a special case of [[ in which you access a single item by actual name. The following are equivalent:

iris$Petal.Length
iris[["Petal.Length"]]

The appeal of this accessor is nothing more than brevity. One character, $, replaces six, [[“”]]. This accessor is handiest when doing interactive programming but should be discouraged for more production oriented code because of its limitations, namely the inability to interpolate the names or use integer indices.

Things to Remember:

You cannot use integer indices

The name will not be interpolated.

Returns only one item.

If the name contains special characters, the name must be enclosed in backticks: “

That is really all there is to it. [ – for subsets, [[ – for extracting items, and $ – for extracting by name.

R-2.9.2 was released in August. While R can be considered stable and battle-ready, it is also far from stagnation. It is humbling to see such an intelligent and vibrant community helping CRAN grow faster than ever. Every day I see a new package or read a new comment on R-Help gives me pause to think.

As much as I like R, on occasion I will find myself lost in some dark corner. Sometimes, I find light. Sometimes I am gnashing teeth and wringing hands. Frustrated. In a recent foray, I found myself trying to do something that I thought exceedingly trivial: expanding character and factor vectors to dummy variables. There must be some function, but what? Trying ?dummy didn’t turn up anything. Surely some else must have encountered this and provided a package. I went to the Internet and sure enough the R-wiki was here to save me. And looking even harder, I found some who had treaded before me on the R-Help archives. It turns out, it’s simple. Expanding a variable as a dummy variable can be done like so:

Two problems. The first problem is that without an external source (Google), I would have never stumbled upon what I wanted. ( Thanks Google!) I understand it now, but for what I wanted to do, I would never have thought, “oh, model.matrix.”

The second problem is the arcane syntax, wtf <- ~ xf - 1. I get it now, but it took me some time to figure out what was going on. I get it, but why not just dummy(var)? This is what I want to do.

The solution on the wiki wasn’t quite what I was looking for. For instance, you can’t say:

model.matrix( ~ xf1 + xf2 + xf3- 1)

It turns out, you can only expand one variable at a time. Well, this is not good. I know that you could solve this with some sapply’s and some tests, but next time I might forgot about how to do it. So with a couple of spare hours, I decided that the next guy, wouldn’t have to think about it. He could just use my dummies package.

Like the R-wiki solution, the dummies package provides a nice interface for encoding a single variable. You can pass a variable -or- a variable name with a data frame. These are equivalent:

dummy( df$var )
dummy( "var", df )

Moreover, you can choose the style of the dummy names, whether to include unused factor level, to have verbose output, etc.

But more than the R-wiki solution, dummy.data.frame offers to something similar to data.frames. You can specify which columns to expand by name or class and whether to return non-expanded columns.

The package dummies-1.04 is available in CRAN. Comments and questions are always appreciated.

Saying MapReduce and Sector’s implementation of User Defined Functions (UDF) over a storage cloud are innovative is only partly correct. The programming models they implement are quite old. Any programmer versed in functional languages recognizes this.

But mapReduce does come with two important innovations. The first is a framework that is specifically designed for large clusters of low-priced, commodity servers. What mapReduce has done is taken the concurrent programming models and applie them to the economic realities of the day. Large and formerly expensive computations can this be accomplished cheaply when distributed to inexpensive machines. The complexity of managing individual machines and tasks is masked from the coder. The coder does not need to worry about associating or managing which tasks get run on which machine. This is invisible to the coder.

The second innovation is the recognition that a large class of practical problems (but not all) can be solved using mapReduce framework. Because the first innovation allowed solutions to problems that were intractable with conventional techniques, technologist began framing problems to run with the MapReduce. They had a hammer; everything began looking like a nail. Fortunately, there were a lot of nails.

As mentioned above, the algorithmic pattern, itself, is not new. It is actually decades old and is a throwback to the early days of functional programming (think Lisp!) and big mainframes. The method was rediscovered, applied over a distributed virtual filesystem, applied to Google’s toughest problems, renamed mapReduce and the rest is history.

The mapReduce algorithm provides a framework for dividing a problem and working on it in parallel. There are two steps: a map step and a reduce step. Although, the two steps must proceed serially — map must preceded reduce — each step can be accomplished in parallel. In the map step, data is mapped to key-value pairs. In the reduce step, the values that share the same key are transformed (‘reduce’) by some algorithm. More complexity can be added; other functions can be used; arbitrary UDF can be supported, as in Sector. But, in essence, the algorithm is as a series of function calls.

The pattern is fairly common and most programmers have used the mapReduce pattern without knowing it, thinking about it, or calling it mapReduce. In fact, much of SAS is setup in a mapReduce style. SAS programs are comprised of DATA STEPs and PROCEDURE STEPs. In certain problems, the DATA step can be a mapper and either a DATA or PROCEDURE step can function as a reducer. If you disregard, the distribution of the problem across servers, I’d venture to say that every SAS programmer has followed this paradigm, often, numerous times in the course of a single program. This simplicity allowed for the application to a wide series of problems, the second innovation.

The same can be said for our favorite statistical programming language, R. In fact, owing to the fact that R’s is a vectorized, functional language, mapReduce boils down to a single line of code:

apply( map(data), reduce )

Where, map and reduce are the user-defined functions for the mapper and reducer respectively and apply distribution the problem in parallel. Any R programmer that was taking advantage of R’s vectorization was probably writing mapReduce problems from day one. Most often, the jobs were vectorized on a individual, single core machines.

Coupled with R packages such as Rmpi, rpvm and nws, the apply-map-reduce pattern can be distributed to several machines. And even more recently, the mutlicore has allowed the easiest implementation on multicores.

We recognized this several years ago, wrote some simple code and have been distributing work across available servers for some time. More recently, we have released our work as an open source package on CRAN for implementing this pattern. Our implementation follows closely to the mapReduce Google paper, is written in pure R and is agnostic to the parallelization backend whether rpvm, rmpi, nws, multicore, or others. ( Revolution Computing recognized this as a goof idea and adopted the same approach with their ParallelR package. )

The use of the mapReduce is exceedingly simple. The package provided a single function, mapReduce. The function is defined as:

UsagemapReduce( map, ..., data, apply = sapply)

Arguments
map An expression to be evaluated on data which yielding a vector that is subsequently
used to split the data into parts that can be operated on independently.
... The reduce step(s). One or more expressions that are evaluated for each of the partitions made
data A R data structure such as a matrix, list or data.frame.
apply The functions used for parallelization

Take the iris dataset, data(iris). Using mapReduce, we can quickly compute the mean and max petal lengths as so:

The mean and max petal lengths are computed for each Species and returned as a matrix with two columns, mean.petal.length and max.petal.length and one row for each Species.

Because we have used expressions in our implementations, you can use almost any R function for the map and reduce step. ( Most work, there are few edge case exceptions.) For example, suppose we wanted to do the above calculation but wanted versicolor and virginica lumped together.

Perl has hashes. Python has dictionaries. Why doesn’t R have an equivalent? Hash tables and associative arrays are indispensable tools for the programmer. One of the most common and basic tasks of a programmer is to “look up” or “map” a key to a value. In fact, there are projects whose sole raison d’être is making the hash as fast and as efficient as possible.

R actually has two equivalents, both lacking. The first is R’s named vectors and lists. Elements of vectors and lists can be accessed by name, through the standard R methods:

obj$name
obj['name']
obj[['name']]

Vectors are not stored using internal hash tables and as they grow large, performance can suffer. The performance impact is tangible even on small lists. For programs doing many look-ups or look-ups on many objects, this can create a bottleneck.

R’s environments are much closer to Perl hashes and Python’s dictionary. The structure of the environment is a hash table internally and look-ups do not appreciably degrade with object size. To use a R environment, you need to create it and assign key-value pairs to it.

Usability. In designing, the S language, John Chambers put much thought into how the analyst and statistician interact with data. All varaibles are designed to be vectors and a standard set of accessors( $, [, [[ ) were defined to retrieve and set slices, subsets or elements of the data. The problem is that R environments don’t follow this pattern. And this is where the hash package comes in.

The hash package is designed to provide an R-syntax to R’s environments and give programmers a hash. The package provides one constructor function, hash that will take a variety of arguments, always doing the right thing. All of the following work:

Michael Driscoll recently wrote a nice blog article entitled the Three Skills of the Data Geeks in the Dataspora Blog. He lists this as studying, data munging and “story-telling”. ( A commenter adds a fourth: decision-making. ) I was excited to see that Driscoll included “story-telling”. I have long felt being able to relate a narrative around data and inference is an important talent. Driscoll, however, pulls his punch. He does not mean literal story-telling, he meant only visualization.

I can’t argue that creating striking visual representations is not an important. But the more general skill of story-teller is also important. Visualization is only part of the presentation that the data. I often meet talented analysts that have nailed Driscoll’s three sexy skills. Invariably, they have one or more advanced degrees and they have a ton of experience working data. They can implement algorithms, parse data. spin graphs with no problem. It is far rarer to meet someone who can tell the narrative and capture interest and excitement of the narrative whispered by data. The one who can is the complete geek.

There are many facets to relating a good story. You must sense the arc of drama, feel tension, contemplate historical context, promotes relevance and emote nuances. Narrative story-telling is not mechanical, it demands creativity, patience, effort, confidence and command of the language. It requires you to have a keen observation and know when an anecdote strengths or dilutes a conclusion. People are wired to respond to stories. Telling them well serves you in any field. This one especially.

* * * * *

I sometimes find myself in the enviable position of looking for the next super-geek whether for Open Data or one of our large clients. When asked to give input on the process, I always stipulate two requirements. The candidates must submit two writing samples: one technical and one not. I can usually gauge from their resume or a quick conversation if they have the first three sexy skills. Figuring out if the grasp the narrative is much harder.

Invariably, I read the non-technical writing sample first. If the candidate has strong command of the language can relate technical (often boring) details in a compelling way and shows an interest in subjects outside the scope of his or her work, I know there is at least a latent potential to be that complete geek we all want on our staff or as colleagues.

Archives

Courses

We'll be giving courses and tutorials in cloud computing and analytics in the Fall. If you are interested in attending a course before then, please contact us at Open Data to arrange an in-house training.