Prediction of random numbers. Part 2

In the first part of this post we considered a human-based binary random numbers generator and built a predictive model for it. The model appeared to work better than a coin toss, at least for a short sequence we had. This part develops more complex models with somewhat higher accuracy.

Our model had been based on sequence of 64 elements. It is a tiny piece of data. Also there was no separation between training and testing datasets, so we can't eliminate possible overfitting (i.e. good results only on data that the model has already seen).

Larger dataset

A good thing is that these issues can be easily resolved by gathering more data. You can fire up your favorite IDE (or text editor) and write a simple application that will assist you in data collection. If you are too lazy, just use my data. A bunch of zeroes and ones typed lead to 5000+ points (100 sequences of 50 elements on average). By the way, you can also find all the code from this post on GitHub.

All the means are now stored in mns variable. Also note that D now holds all datapoints concatenated together. We will use it later.

Mean of mns (i.e. mean of means) appears to be mean(mns) = 0.5033, so on average I produce same amount of zeroes and ones. Standard deviation for distribution of means is std(mns) = 0.044. Thus ones rate for most of sequences would lie in 42-59% interval. You can get these numbers by stepping two standard deviations from mean on both sides. Visual checking of histogram for distribution of means (hist(mns)) confirms our calculations.
Mean close to 50% says that coin tossing will deliver us error rates close to 50%. Doesn't seems to be an exciting result. 🙂

Now let's go a bit deeper and calculate sample autocorrelation function (ACF). In the previous post we used R acf function to do this. To do the same in MATLAB I've coded a simple ACF calculation code which you can find here. It returns ACF values for given number of lags and plots a graph if no output arguments provided. Here is an example of call and a figure that it produces.

get_acf(D);% D contains all data points

Red lines represent confidence interval for autocorrelation of white noise. It is clearly seen that first 5-6 lags have significant values of ACF. It is possible that this is related to capacity of our working memory. There is a famous paper by George Miller "The Magical Number Seven, Plus or Minus Two". The number of objects that a human can hold in his working memory is close to 7. Look at the figure above: correlations vanish for lags higher than 6-7. A coincidence?

In first part we had observed that closest neighbors possessed significant negative correlation. So why don't we try the model from the previous part on this new data? After running a test you will find out that error rate would be 38%. It is quite good, far better than a coin toss. Still there are other peaks of ACF that we haven't taken into account yet. Can we do better by exploiting correlations on higher lags?

Distribution modeling

To get use of more information we need a more complex model. Apparently, this model should involve information about sequence development history. There are lots of models based on number of ones in previous N steps, length of the last sequence of zeroes or linear combination of previous elements or some other statistics. Among them are LPC, multistep Markov chains or even SVMs and neural networks.

A more straightforward approach is to model a joint distribution for a few consecutive elements. To do that we should split sequences in chunks of data (frames). For example, if we want to model a joint distribution of three consecutive elements, then the sequence of seven elements {0, 1, 1, 0, 1, 0, 1} leads to five frames: {0, 1, 1}, {1, 1, 0}, {1, 0, 1}, {0, 1, 0}, {1, 0, 1}. Note that frames are overlapping. One of ways to model a distribution is to train a Restricted Boltzman Machine. It is a cool unsupervised model that had become popular in last several years. It was invented by Geoffrey Hinton and is an important building brick for today's machine learning mainstream: deep learning. However, we will use a more simple yet powerful model. Recall the way we got our trivial guesser: we were counting pairs of neighbor elements. To explore more sophisticated data patterns we will count occurrences of three, four or even more elements. To model arbitrary joint distribution of N binary variables you need parameters. Thus number of parameters grows exponentially fast with history length. With amount of data at hand we should expect models with memory length up to 7-8 would work reasonably well. Luckily, ACF suggests that we probably wont need any longer memory.

Following code goes through a sequence data frame-by-frame and checks ord consecutive elements. Then it looks at next element and updates corresponding counter p.

I'll explain counter indexing rules a bit. If next element (i.e. the one after ord consecutive elements) is zero, then first row is used. Otherwise second row is used. Column index ind is a number that we get after treating consecutive elements as a binary number and converting them to decimal base (e.g. 1001 leads to index 9). So, each element of the counter p is a number of occurrences of zero or one (depending on row index) after previous ord elements (determined by column index). This representation is flexible enough so that we can add more sequences and update it using only new data. Getting probabilities from counters is still easy. For example, the conditional probability of finding 1 given the subsequence {0, 1, 1} of previous elements can be obtained by evaluating:

Prediction of future elements

Suppose that we have estimated the distribution. What next? Now we need to use it to predict future outcomes given the history. The most straightforward way is to sample from the distribution. The other way is to apply a threshold, i.e. to always predict the value with higher probability. I've coded a script that runs a test for models with various memory lengths. It uses half of sequences to estimate distributions and then tests models on remaining data. So, tests are run on data that wasn't seen by models before.

First let's check how model will work in case if we use a random distribution instead, i.e. all values in p variable are just random numbers. So, we generate random distribution and sample from it.

What do all these numbers mean? Third line reads: "3-step memory: 1292/2368 errors (54.56%), CI: 2.84%, chance 6.25%". It describes results of test for model which takes into account previous three elements. After running prediction test on 2368 cases, 1292 mistakes occurred. That yields error rate 54.56%. Radius of the 95% confidence interval is approximately 2.84%. So, true error rate is somewhere between 51.72% and 57.40%. You probably noticed that number of cases is not the same for all the models. We can not predict outcomes when we don't know the history. So, for model with 5-step memory we can't predict first 5 elements in each sequence. This is the reason why number of cases gets smaller for models with longer memory.

It is clearly seen that for all the models we've got a high error rate. Lower value for 2-step memory is likely got by chance. What if we try applying threshold instead of sampling?

Error rates got lower, but they are still close to 50%. If you run the test multiple times it is possible that you'd occasionally see good results. Especially for models with shorter memory. The reason for this is in very limited space of possible model instances. Consider a 1-step model with a threshold. You can formulate only four distinct rules: predict 1 if previous is 1, predict 1 if previous is 0, predict 0 if previous is 1, predict 0 if previous is 0. By choosing two non-contradictory rules from this set you get a predicting model. There are only four possible ways to do that. If one of them reflects real distribution of data, then with probability 25% a random 1-step model with a threshold will work good. That is what chance value describes. For N-step memory you can build models. Thus, chance of guessing a good model is at least. Chance drops quickly and seems low for memory length equal to 3 or greater.

Now it is time to run a test for predictors based on real distribution. Results for predictions using sampling.

Error rates got decreased by 7-10% for all memory lengths. So, one thing is sorted out: models with threshold work better for our problem. The decrease is indeed big.

We have achieved the error rate comparable to what we had in the first part. The key difference is that now we used much more data. And thus results are more reliable. Upper bound for 95% confidence interval for error rate of the best model is lower than 38% (it was around 46% for model from part 1). Shorter (1- or 2-step) memory leads to slightly smaller accuracy, but the difference between "3+ models" is not significant. So, if you need a simple yet accurate model, choose the 3-step model. If you need the most accurate model, try 6-step model. Though it is likely it would work as accurate as the one with 3-steps.

Summary

The post appears to be a bit longer than I expected. Here is a short TL;DR summary. We have used manually generated data to understand if it possible to predict "random" numbers that are created by a human being. Quick check of ACF shows that there are significant correlations on lags up to 6-7. This could be related to estimates of working memory capacity. Also by manually examining data you can find that it is very unlikely to find a long sequence of zeroes or ones. More unlikely than in case of Bernoulli process. It is possible that in your case patterns would be different. When I asked my wife to do experiment from the first part it emerged that she tends to produce longer sequences of same elements. However this is a pattern too.

After reviewing the data we estimated the distribution describing conditional probabilities of finding 1 at some position if previous history is given. To run tests we used sequences that haven't been seen by model before. Models that have information about at least three previous elements achieve error rate of 35% and less. It means that approximately in 2 of 3 cases these models would do correct prediction. So, treating yourself as a random numbers generator is not a good idea. 😉