Monthly Archives: November 2013

Say you have a column which contains categorical variables / factors. Then How can we quickly and confidently identify the differences between different categories?

A very commonly used dataset ‘mpg’ from the package ‘ggplot2’ contains some categorical variables that could easily get started. There is a column in mpg called ‘cty’ which is the miles per gallon for a car while driving in city. And also another column ‘manufacturer’ which contain categorical variable the different manufacturers.

Sir Francis Galton, who is the cousin of Charles Darwin was an English Victoria polymath, eugenicist and statistician. He proved the linear combination of normal distribution is still normal distribution and he is also,the inventor of the triangle device – quincunx. Here, we will talk about the research that he has implemented long time ago(1885).

I always use histogram command to look at the distribution of a uni-variable data. And to distinguish the difference of distribution of a subgroup from the whole, I usually do two or more plots with the same scale.

You can clearly read the information that the more cylinders you have in your car, the more gas it will take in general, of course, a lower MPG (Miles Per Gallon).

Also, you can use the density plot to see the distribution of data for each cylinder and plot them all together in one plot. I also attached a boxplot using factor which give a better picture of what is going on. By the way, the varwidth=TRUE will set the width of the box as the number of records in that box. In this case, you will see we have almost same amount of records inside each category.

Say, you have a time series object that you want to work with. You might want to check if the time series happens to be generated by white noise only, this might happen not only during the interview but also in real life sometime.

If you have three contiguous time points, assuming they are of different values. Then there are 6 permutations to order them. And four of them form a turning point (PEAK/PIT).

The expectation of number of turning points for a time series generated by white noise is then 2/3 * n, and actually they falls into a normal distribution where the variance is 8/45*n.

So your 5% confidence test should check if the number of turning points is within the range of: 2/3*n +- 1.96*sqrt(8/45*n):