Mike's Technology and Finance Blog

Mike's Technology and Finance Blog covers a number of different topics in finance and technology. Most technical posts provide architecture, development, implementation, troubleshooting techniques for different Enterprise IT systems that run on the Windows, UNIX, and Linux platforms. Some posts also include my personal opinions and rants.

Thursday, March 17, 2016

The lolcat Statistical Package - Public Release

Mike Burr

Background and Project Goals

The world needs a statistical tool that is valuable both for teachers and practitioners. Many of the statistical tools in use today are highly expensive, proprietary, and carry a large amount of baggage. Many tools can’t be reasonably automated. Tools that can be automated (often written in R, Python, Java, and other languages) are incomplete from both a teaching and working standpoint.

The main objective of this project is to provide a tool set that is accurate, reliable, efficient (for the end user), and provides enough additional functionality to be valuable above and beyond many of the commercial tools in existence today. Ideally, users of this tool should not need to be proficient developers. In fact, using the code samples included in the online package documentation along with a couple of small sections of my tutorial on R, practitioners and students (at all levels) should not really need to know much at all in terms of R.

You’ll find something useful in this package regardless of whether you are a student just starting your journey or an experienced practitioner analyzing large amounts of data.

Cost to Use the Package

The public release of the package itself is offered free of charge and is open source. You are free to take it and modify it any way that you choose.

The online documentation is not included under the open source license and may not be redistributed or stored without express written consent; however it is free to access online anytime.

See the LICENSE file in the package or on github for more details on the license.

Reasonably, if you are working on behalf of a commercial entity and you are using the package in a significant capacity, please support the project with a donation (contact me for more details). Also, if you find that you would like paid support and/or feature development, please contact me for more details.

Where to Obtain the Package Source

If you want to browse the package source, see the official github location:

Monday, June 15, 2015

Numeric Operators in R

R provides a number of operators for performing mathematical operations on numbers and vectors. Unlike most other programming languages, R can seamlessly take objects of different forms (single numbers, vectors, arrays) and perform operations with them.
In the case when two inputs have different lengths, R will “recycle” elements from the shorter input by repeating the shorter input to get the correct length. This is best demonstrated by example:

## Warning in a + b: longer object length is not a multiple of shorter object
## length

## [1] 2 3 4 2 3 4 2 3 4 2

In general, I think recycling elements makes understanding R code harder, and I would generally treat it as an antipattern in R (i.e. don’t do it…). In addition to making code harder to understand, it will also help you start a bad habit of ignoring errors and warnings in R. There are better methods (rep,expand.grid,combn, etc) that preserve code readability if you (or someone else) needs to do any review/modification in the future.
Like most other programming languages, R operators follow a precedence that is similar to the precidence taught to math students (i.e. parentheses first, then multiply/divide, then addition/subtraction). I will cover this section discussing the most common operators and precedence from high to low (see ?Syntax for the full list).

^ Exponentiation

An exponent can be defined as one number (the base) multiplied by itself a number of times (the exponent). Here 4 is the exponent and 2 is the base:\[ 2^4 = 2 \cdot 2 \cdot 2 \cdot 2 = 16 \]
In the case of a non-whole exponent, exponentiation works as an nth root operation. For example:\[ 8^{\bar{.3}} = 8^{\frac{1}{3}} = \sqrt[3]{8^1} = \sqrt[3]{8} = 2 \]
Negative exponents work as division operations. For example:\[ 2^{-3} = \frac{1}{2^3} = \frac {1}{8} = 0.125 \]
R accomplishes exponentiation via the ^ operator. A few examples:

2^3#Exponent of a single number

## [1] 8

(1:10)^2#Exponent over an entire vector

## [1] 1 4 9 16 25 36 49 64 81 100

c(4,9,16,25,36)^(1/2)#nth root operation

## [1] 2 3 4 5 6

* Multiplication and / Division Operators

Multiplication can be defined as the repeated addition of a number. Example:\[ 5 * 3 = 5 + 5 + 5 = 15 \]
Division can be defined in terms of splitting a quantity (called the numerator) between a set of groups (denominator). Example: 20 split between 5 groups yields 4 for each of the 5 groups:\[ 20 / 5 = 4 \]
Various rules exist and are taught in low level mathematics courses for finding multiplication and subtraction for non-whole numbers by hand.
R accomplishes multiplication with the * operator. A few examples:

2*2 #Multiplication of two numbers

## [1] 4

(1:5)*rep(2,5) #Multiplication of vectors

## [1] 2 4 6 8 10

c(1,2,3,4,5)*c(2,2,2,2,2) #Same as above

## [1] 2 4 6 8 10

(1:5)*2 #Multiplication of a vector by a number (all elements multiplied)

## [1] 2 4 6 8 10

The vector example above illustrated (O are input, X are output):
R accomplishes division with the / operator. A few examples:

4/2 #Division of two numbers

## [1] 2

seq(2,10,by=2)/rep(2,5)#Division of vectors

## [1] 1 2 3 4 5

c(2,4,6,8,10)/c(2,2,2,2,2) #Same as above

## [1] 1 2 3 4 5

seq(2,10,by=2)/2 #Division of a vector by a number (all elements divided by number)

## [1] 1 2 3 4 5

The vector example above illustrated (O are input, X are output):
The input objects need to be numeric to utilize numeric operators. An example is that lists can’t be multiplied/divided directly:

Matrix multiplication uses a different operator (%*%). See the post on Matrix operations for more details.

+ Addition and - Subtraction Operators

Addition can be thought of as finding the magnitude of two combined quantities.\[ 1 + 1 = 2 \]
Subtraction can generally be thought of as finding the difference between two values.\[ 2 - 1 = 1 \]
R accomplishes addition with the + operator. A few examples:

+5 #Unary addition operator - no change/effect

## [1] 5

2+10 #Addition of 2 numbers

## [1] 12

1:10+2 #Addition of number and vector

## [1] 3 4 5 6 7 8 9 10 11 12

1:10+11:20#Addition of 2 vectors

## [1] 12 14 16 18 20 22 24 26 28 30

The vector example above illustrated (O are input, X are output):
R accomplishes subtraction with the - operator. A few examples:

-5 #Unary negation operator - creates a negated quantity

## [1] -5

12-2 #Subtraction of 2 numbers

## [1] 10

3:12-2 #Subtraction of number and vector

## [1] 1 2 3 4 5 6 7 8 9 10

11:20-1:10#Subtraction of 2 vectors

## [1] 10 10 10 10 10 10 10 10 10 10

The vector example above illustrated (O are input, X are output):
R has more operators that will be considered in other posts.

Saturday, June 6, 2015

Data Types in R

An important first step in understanding the R programming language is gaining an understanding of how R represents different types of data. This post will give an introduction to the range of data types available.

Primitive Values

The R programming language has a few primitive (called atomic) types that are going to be relevant to most people who aspire to develop in R:

Complex: Complex numbers… \(a + bi \) where \( i = sqrt{ -1 } \)

There are a few additional types that appear, but will be addressed in other parts of the tutorial:

Expressions: Parsed strings of R code

Symbols: Typically used to insert mathematical notation into plots

Functions: Performs a set of operations on a set of inputs and may or may not return a result

Vectors

The need arises frequently (perhaps more frequently than storing a single value) to store more than one value and access the values in an efficient way. This is accomplished using vectors, arrays (covered later), data frames (covered later), and lists (covered later).
The easiest way to create a vector is with the rep, c, seq commands or the : operator:rep creates a vector with the first argument repeated n times:

seq generates ascending or descinding sequences using a start point and an end point, then either a length or an increment:

a<-seq(from=1,to=10,by=1) #Equivalent to 1:10
a<-seq(1,10,1) #shorter version with from, to, increment...
a<-seq(1,10,length.out=10) #Equivalent to 1:10
a

## [1] 1 2 3 4 5 6 7 8 9 10

Individual vector elements can be accessed with [ and ]. Some illustrative examples:

a[5] #The fifth element

## [1] 5

a[-5] #Not the fifth element

## [1] 1 2 3 4 6 7 8 9 10

a[1:3] #Elements 1,2, and 3

## [1] 1 2 3

a[c(1,2,5,7,10)] #Elements 1,2,5,7, and 10

## [1] 1 2 5 7 10

Another interesting thing is that more than one element of a vector can be changed at a time:

a<-1:10
a #starting point

## [1] 1 2 3 4 5 6 7 8 9 10

a[5]<-20 #Change 1 element
a

## [1] 1 2 3 4 20 6 7 8 9 10

a[c(1,2,3)] <- c(10,20,30) #Change 3 elements
a

## [1] 10 20 30 4 20 6 7 8 9 10

Matrices and Data Frames

R has a rich capability to natively build and manipulate matrices (memory efficient structures of only one primitive type) and data frames (structures that can hold vectors of differing primitive types). Selection of data frame vs. matrix usually depends on what libraries you are using and/or what the functions you develop expect as input.
Matrices can be created and filled at the same time:

a<-matrix(data=1:9,nrow=3,ncol=3) #3x3 matrix filled by columns
a

## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9

a<-matrix(data=1:9,nrow=3,ncol=3, byrow=TRUE) #Fill by row using data
a

## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9

Assuming that the example above is a data matrix, an equivalent data frame can be built:

#Note: c1,c2,c3 become the column names. Data frames are filled by columns
a<-data.frame(c1=c(1,2,3),c2=c(4,5,6),c3=c(7,8,9))
a

## c1 c2 c3
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9

a<-data.frame(c1=c(1,4,7),c2=c(2,5,8),c3=c(3,6,9))
a

## c1 c2 c3
## 1 1 2 3
## 2 4 5 6
## 3 7 8 9

Individual elements in both matrices and data frames can be accessed using [ and ]:

Lists are accessed a little bit differently. Similar to data frames, named elements can be accessed:

# How to determine names of list elements
names(a)

## [1] "e1" "e2" "e3"

# Access elements by name
a$e1 #The first list element

## [1] 1 2 3

a$e2 #The second list element

## [,1] [,2]
## [1,] 1 3
## [2,] 2 4

a$e3 #The third list element

## [1] FALSE

#Access elements by name in a different way
a[["e1"]]

## [1] 1 2 3

a[["e2"]]

## [,1] [,2]
## [1,] 1 3
## [2,] 2 4

a[["e3"]]

## [1] FALSE

Lists can also be accessed with the [[ and ]] operators and numeric indices:

#How to determine a list length
length(a)

## [1] 3

#Access elements by index:
a[[1]] #The first list element

## [1] 1 2 3

a[[2]] #The second list element

## [,1] [,2]
## [1,] 1 3
## [2,] 2 4

a[[3]] #The third list element

## [1] FALSE

Lists are most commonly used in R to represent more complex data structures and to implement a version of typing for complicated data structures that can have states and behaviors of their own (a different post will discuss object-oriented R in more detail.

The Sample Variance and Standard Deviation

The sample variance and standard deviation can be thought of as measures of the spread between the mean and the points in the sample. The Sample variance is defined as the sum of the squared deviations from the mean, divided by an adjusted sample size to make the statistic “unbiased”:\[ s^2_x = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1} \]
The sample standard deviation is the square root of the sample variance:\[ s_x = \sqrt{s_x^2} = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}} \]
Visually, for normally distibuted data, the standard deviation can be interpreted as the arrow from the mean:
Another view with 1 standard deviation on either side of the mean shaded:
We’ll talk more about dispersion measures in the posts on random sampling distributions.
There are a few guidelines to using the variance/standard deviation:

The variance and standard deviation are measures of dispersion/spread for data that is measured on a continuous scale (As opposed to interval/ratio scale, review data classification here)

The standard deviation and variance are generally not appropriate for ordinal/nominal scale data. Using the variance/standard deviation on ordinal/nominal scale data can lead to meaningless statements of the form:

20% of the time we would expect to see the number of bee hives per acre less than -1.

“Our survey yielded a standard deviation for Satisfaction of 2, meaning that a large percentage of our survey respondents are off scale” [Satisfaction is a nominal (and at best ordinal) measure. Statistical procedures involving means and standard deviations have no place in survey analysis, most of the time…]