Data Scientist at Yahoo!

Menu

Assume that a gambler has the possibility to bet a fraction of his capital in the outcome of a specific event. The Kelly criterion first presented in [1] and summarized below find the that maximizes the exponential rate of growth of the gambler’s capital under different scenarios, which is equivalent to maximizing period by period the expected log utility based on the current capital.

Discussion on why this choice of optimization makes sense was formally discussed in [2] and might be the subject of a future post. Intuitively, it makes sense to use this criterion if you bet regularly and reinvest your profits.

Exponential rate of growth

Lets define a quantity called the exponential rate of growth of the gambler’s capital, where

and is the gambler’s capital after bets, is his starting capital, and the logarithm is to the base two. is the quantity we want to maximize.

Perfect knowledge

In the case of perfect knowledge, the gambler would know the outcome of the event before anyone else and would be able to bet his entire capital at each bet. Then, and .

Binary events

Consider now a binary event where the gambler has a probability of success and a probability of failure. In this case the gambler would go broke for large with probability if he betted all his capital in each bet, even though the expected value of his capital after bets is given by

Because of that, let us assume that the gambler will bet a fraction of his capital each time. Then

where and are the number of wins and losses after bets. Following the definition given in Eq. (1), it can be shown that

If the payoff is for a win and for a loss, then the edge is , the odds are , and

Multiple outcome events

Lets now consider the case where the event has more than two possible outcomes, not necessarily equally likely.

– Fair odds and no “track take”

Lets first consider the case of fair odds and no “track take”, that is

where is the probability of observing the outcome in a given event, as estimated by the entity offering the odds.

Consider to be the fraction of the gambler’s capital that he decides to bet on based on his belief of the probability of observing the outcome in a given event. The gambler’s estimated probability for an outcome will be denoted by .

Since there is no “track take”, there is no loss in generality in assuming that

That is, the gambler bets his total capital divided among the possible outcomes.

It should be noted that if for all no bets are placed. But if the largest some bets might be made for which , i.e. the expected gain is negative. This violates the criterion of the classical gambler who never bets on such events.

Besides being an amazing interactive tool for data analysis, R software commands can also be executed as scripts. This is useful for example when we need to work in large projects where different parts of the project needs to be implemented using different languages that are later glued together to form the final product.

In addition, it is extremely useful to be able to take advantage of pipeline capabilities of the form

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. — Doug McIlroy

Basic R script

A basic template for an R script is given by

#! /usr/bin/env Rscript
# R commands here

To start with a simple example, create a file myscript.R and include the following code on it:

#! /usr/bin/env Rscript
x <- 5
print(x)

Now go to your terminal and type chmod +x myscript.R to give the file execution permission. Then, execute your first script by typing ./myscript.R on the terminal. You should see

[1] 5

displayed on your terminal since the result is by default directed to stdout. We could have written the output of x to a file instead, of course. In order to do this just replace the print(x) statement by some writing command, as for example

There are different ways to process command-line arguments in R scripts. My favorite so far is to use the getopt package from Allen Day and Trevor L. Davis. Type

require(devtools)
devtools::install_github("getopt", "trevorld")

in an R environment to install it on your machine. To use getopt in your R script you need to specify a 4 column matrix with information about the command-line arguments that you want to allow users to specify. Each row in this matrix represent one command-line option. For example, the following script allows the user to specify the output variable using the short flag -x or the long flag --xValue.

As you can see above the spec matrix has four columns. The first defines the long flag name xValue, the second defines the short flag name x, the third defines the type of argument that should follow the flag (0 = no argument, 1 = required argument, 2 = optional argument.), the fourth defines the data type to which the flag argument shall be cast (logical, integer, double, complex, character) and there is a possible 5th column (not used here) that allow you to add a brief description of the purpose of the option. Now our myscript.R accepts command line arguments:

./myscript.R
[1] 5
myscript.R -x 7
[1] 7
myscript.R --xValue 9
[1] 9

Verbose mode and stderr

We can also create a verbose flag and direct all verbose comments to stderr instead of stdout, so that we don’t mix what is the output of the script with what is informative messages from the verbose option. Following is an illustration of a verbose flag implementation.

The main difference of directing verbose messages to stderr instead of stdout appear when we pipe the output to a file. In the code below the verbose message appears on the terminal and the value of x goes to the output_file.txt, as desired.

The take fully advantage of the pipeline capabilities that I have mentioned at the beginning of this post, it is useful to accept input from stdin. For example, a template of a script that reads one line at a time from stdin could be

Note that when we are running our R scripts from the terminal we are in a non-interactive mode, which means that

input_con <- stdin()

would not work as expected on the template above. As described on the help page for stdin():

stdin() refers to the ‘console’ and not to the C-level ‘stdin’ of the process. The distinction matters in GUI consoles (which may not have an active ‘stdin’, and if they do it may not be connected to console input), and also in embedded applications. If you want access to the C-level file stream ‘stdin’, use file(“stdin”).

And that is the reason I used

input_con <- file("stdin")
open(input_con)

instead. Naturally, we could allow the data to be inputted from stdin by default while making a flag available in case the user wants to provide a file path containing the data to be read. Below is a template for this:

On a previous post, I have mentioned what is called the separation problem [1]. It can happen for example in a logistic regression, when a predictor (or combination of predictors) can perfectly predicts (separate) the data, leading to infinite Maximum Likelihood Estimate (MLE) due to a flat likelihood.

I also mentioned that one (possibly) naive solution to the problem could be to blindly exclude the predictors responsible for the problem. Other more elegant solutions include a penalized likelihood approach [1] and the use of weakly informative priors [2]. In this post, I would like to discuss the latter.

Model setup

Our model of interest here is a simple logistic regression

and since we are talking about Bayesian statistics the only thing left to complete our model specification is to assign prior distributions to ‘s. If you are not used to the above notation take a look here to see logistic regression from a more (non-Bayesian) Machine Learning oriented viewpoint.

Weakly informative priors

The idea proposed by Andrew Gelman and co-authors in [2] is to use minimal generic prior knowledge, enough to regularize the extreme inference that are obtained from maximum likelihood estimation. More specifically, they realized that we rarely encounter situations where a typical change in an input corresponds to the probability of the outcome changing from 0.01 to 0.99. Hence, we are willing to assign a prior distribution to the coefficient associated with that gives low probability to changes of 10 on logistic scale.

After some experimentation they settled with a Cauchy prior with scale parameter equal to (Figure above) for the coefficients , . When combined with pre-processed inputs with standard deviation equal to 0.5, this implies that the absolute difference in logit probability should be less then 5, when moving from one standard deviation below the mean, to one standard deviation above the mean, in any input variable. A Cauchy prior with scale parameter equal to was proposed for the intercept . The difference is because if we use a Cauchy with scale for it would mean that would probably be between and for units that are average for all inputs and as a default prior this might be too strong assumption. With scale equal to 10, is probably within and in such a case.

There is also a nice (and important) discussion about the pre-processing of input variables in [2] that I will keep for a future post.

Conclusion

I am in favor of the idea behind weakly informative priors. If we have some sensible information about the problem at hand we should find a way to encode it in our models. And Bayesian statistics provides an ideal framework for such a task. In the particular case of the separation problem in logistic regression, it was able to avoid the infinite estimates obtained with MLE and give sensible solutions to a variety of problems just by adding sensible generic information relevant to logistic regression.

Datasets come sometimes with predictors that take an unique value across samples. Such uninformative predictor is more common than you might think. This kind of predictor is not only non-informative, it can break some models you may want to fit to your data (see example below). Even more common is the presence of predictors that are almost constant across samples. One quick and dirty solution is to remove all predictors that satisfy some threshold criterion related to their variance.

Here I discuss this quick solution but point out that this might not be the best approach to use depending on your problem. That is, throwing data away should be avoided, if possible.

It would be nice to know how you deal with this problem.

Zero and near-zero predictors

Constant and almost constant predictors across samples (called zero and near-zero variance predictors in [1], respectively) happens quite often. One reason is because we usually break a categorical variable with many categories into several dummy variables. Hence, when one of the categories have zero observations, it becomes a dummy variable full of zeroes.

If we take a closer look at those predictors indicated as problematic by lda we see what is the problem. Note that I have added +1 to the index since lda does not count the target variable when informing you where the problem is.

As we can see above no loan was taken to pay for a vacation and there is no single female in our dataset. A natural first choice is to remove predictors like those. And this is exactly what the function nearZeroVar from the caret package does. It not only removes predictors that have one unique value across samples (zero variance predictors), but also removes predictors that have both 1) few unique values relative to the number of samples and 2) large ratio of the frequency of the most common value to the frequency of the second most common value (near-zero variance predictors).

We can see above that if we call the nearZeroVar function with the argument saveMetrics = TRUE we have access to the frequency ratio and the percentage of unique values for each predictor, as well as flags that indicates if the variables are considered zero variance or near-zero variance predictors. By default, a predictor is classified as near-zero variance if the percentage of unique values in the samples is less than and when the frequency ratio mentioned above is greater than 19 (95/5). These default values can be changed by setting the arguments uniqueCut and freqCut.

Now, should we always remove our near-zero variance predictors? Well, I am not that comfortable with that.

Try not to throw your data away

Think for a moment, the solution above is easy and “solves the problem”, but we are assuming that all those predictors are non-informative, which is not necessarily true, specially for the near-zero variance ones. Those near-variance predictors can in fact turn out to be very informative.

For example, assume that a binary predictor in a classification problem has lots of zeroes and few ones (near-variance predictor). Every time this predictor is equal to one we know exactly what is the class of the target variable, while a value of zero for this predictor can be associated with either one the classes. This is a valuable predictor that would be thrown away by the method above.

This is somewhat related to the separation problem that can happen in logistic regression, where a predictor (or combination of predictors) can perfectly predicts (separate) the data. The common approach not long ago was to exclude those predictors from the analysis, but better solutions were discussed by [2], which proposed a penalized likelihood solution, and [3], that suggested the use of weekly informative priors for the regression coefficients of the logistic model.

Personally, I prefer to use a well designed bayesian model whenever possible, more like the solution provided by [3] for the separation problem mentioned above. One solution for the near-variance predictor is to collect more data, and although this is not always possible, there is a lot of applications where you know you will receive more data from time to time. It is then important to keep in mind that such well designed model would still give you sensible solutions while you still don’t have enough data but would naturally adapt as more data arrives for your application.

I am currently a research fellow and a Stats PhD candidate at NTNU under the supervision of Håvard Rue. I have spent a great four years in the INLA group. No words can express how much I am grateful to Håvard, who has been a good friend and an amazing supervisor (and a great chef I must say).

My PhD thesis will be a collection of six papers ranging from Bayesian computation to the design of sensible Bayesian models through an interesting framework to construct prior distributions. The thesis is almost done and I will at some point cover its main ideas on this blog. We are very happy with the work we have done in these four years. My PhD defense should happen around September this year.

As the saying goes, all good things must come to an end, and Friday is my last day at NTNU. However, I am very excited with my new job.

Starting on Monday (March 3rd), I will work as a Data Scientist at Yahoo! I will be located at the Trondheim (Norway)’s office, which is very fortunate for me and my wife, but will of course collaborate with the many Yahoo! Labs around the world. This was the exact kind of job I was looking for, huge amounts of data to apply my data analysis skills. I am sure I have a lot to contribute to as well as learn from Yahoo! I am looking forward to it.

My plan for this blog is to continue on the same path, posting about once a week on a subject that interests me, usually involving data analysis. My new job will probably affect my interests, and this will of course impact what I write on the blog. So, expect to see more stuff about Big Data and Text Analysis, although I will not restrict my interests on those subjects. Besides, it is always good to remind that this is my personal blog and there is no connection with any job I have at any moment, so opinions here are my own.

This post deals with the basics of character strings in R. My main reference has been Gaston Sanchez‘s ebook [1], which is excellent and you should read it if interested in manipulating text in R. I got the encoding’s section from [2], which is also a nice reference to have nearby. Text analysis will be one topic of interest to this Blog, so expect more posts about it in the near future.

Creating character strings

The class of an object that holds character strings in R is “character”. A string in R can be created using single quotes or double quotes.

We can create an empty string with empty_str = "" or an empty character vector with empty_chr = character(0). Both have class “character” but the empty string has length equal to 1 while the empty character vector has length equal to zero.

The function character() will create a character vector with as many empty strings as we want. We can add new components to the character vector just by assigning it to an index outside the current valid range. The index does not need to be consecutive, in which case R will auto-complete it with NA elements.

The functions as.character() and is.character() can be used to convert non-character objects into character strings and to test if a object is of type “character”, respectively.

Strings and data objects

R has five main types of objects to store data: vector, factor, multi-dimensional array, data.frame and list. It is interesting to know how these objects behave when exposed to different types of data (e.g. character, numeric, logical).

vector: Vectors must have their values all of the same mode. If we combine mixed types of data in vectors, strings will dominate.

arrays: A matrix, which is a 2-dimensional array, have the same behavior found in vectors.

data.frame: By default, a column that contains a character string in it is converted to factors. If we want to turn this default behavior off we can use the argument stringsAsFactors = FALSE when constructing the data.frame object.

Modeling is one of the topics I will be writing a lot on this blog. Because of that I thought it would be nice to introduce some datasets that I will use in the illustration of models and methods later on. In this post I describe the German credit data[1], very popular within the machine learning literature.

This dataset contains rows, where each row has information about the credit status of an individual, which can be good or bad. Besides, it has qualitative and quantitative information about the individuals. Examples of qualitative information are purpose of the loan and sex while examples of quantitative information are duration of the loan and installment rate in percentage of disposable income.

This dataset has also been described and used in [2] and is available in R through the caret package.

require(caret)
data(GermanCredit)

The version above had all the categorical predictors converted to dummy variables (see for ex. Section 3.6 of [2]) and can be displayed using the str function:

For data exploration purposes, I also like to keep a dataset where the categorical predictors are stored as factors rather than converted to dummy variables. This sometimes facilitates since it provides a grouping effect for the levels of the categorical variable. This grouping effect is lost when we convert them to dummy variables, specially when a non-full rank parametrization of the predictors is used.

The response (or target) variable here indicates the credit status of an individual and is stored in the column Class of the GermanCredit dataset as a factor with two levels, “Bad” and “Good”.

We can see above (code for Figure here) that the German credit data is a case of unbalanced dataset with of the individuals being classified as having good credit. Therefore, the accuracy of a classification model should be superior to , which would be the accuracy of a naive model that classify every individual as having good credit.

The nice thing about this dataset is that it has a lot of challenges faced by data scientists on a daily basis. For example, it is unbalanced, has predictors that are constant within groups and has collinearity among predictors. In order to fit some models to this dataset, like the LDA for example, we must deal with these challenges first. More on that later.