Monday, 3 October 2016

While writing a paper on sparse principal component analysis I came across an old dataset containing 1990s socio-economic data and rate of violent crime for 1994 communities in the US. I am not a sociologist, so my analysis may be superficial, but I found the results interesting with respect to Mr Trump's political views. Looking at the results, it turns out that the traditional approach of considering only the largest loadings of the PCs seems to support the view that immigrants are a major cause of violent crime Instead, applying SPCA gives an entirely different view of the problem identifying mainly socio-economical characteristics, rather than being immigrants or speaking poor English, as drivers for crime. Naturally, the two things are correlated but the causal inference may be different.

The dataset is called Communities and Crime and can be downloaded from the UCI Machiine Learning Repository. I deleted 26 variables with missing values, ending up with 99 explanatory variaables and ran Principal Components Analysis on their correlation matrix. The the first and second PC respectively explain 25.3% and 17% of the variance of the data, Their ordered contributions (loadings scaled to unit sum of the absolute values) are plotted below.

Clearly it is difficult to choose a threshold to isolate the most important contributions. I selected the ten largest contributions, corresponding to a threshold of about 2%. Then I ran a sparse PCA algorithm of my invention* requiring that each sparse component explained at least 95% of the variance explained by the corresponding PC, Both sets of loadings are shown in Table 1 below. Rj-sq is the R squared resulting from regressing a variable on all other in the set (the one used to compute the variance inflation factor in regression), the closer the value to one the more the variable is (multiply) correlated with the others.

TABLE 1

Contribution

Variable

Rj-sq

PCA

-2.2%

median family income (differs from household income)

0.98

-2.2%

median household income

0.98

-2.1%

percentage of kids in family housing with two parents

0.98

-2.1%

percentage of households with investment / rent income in 1989

0.83

2.1%

percentage of people under the poverty level

0.84

-2.1%

percentage of families (with kids) that are headed by two parents

0.98

-2.0%

percent of kids 4 and under in two parent households

0.90

-2.0%

per capita income

0.92

2.0%

percentage of households with public assistance income in 1989

0.75

2.0%

percent of occupied housing units without phone (in 1990, this was rare!)

0.76

SPCA

-51%

median family income (differs from household income)

0.51

-37%

percentage of kids in family housing with two parents

0.52

12%

percent of family households that are large (6 or more)

0.09

It turns out that the ten most influential variables in the first PC are highly correlated and are either indicators of income or of families with two parents. They all have more or less the same weight. The sparse component with only three variables explains 96.5% percent of the variance explained by the PC. The contributions, like for the PC, indicate the Income (51%) and family with two parents (37%) are influential but also add a 12% contribution from percentage of large families. This variable is hardly correlated with the others and its contribution to the PC ranks 46th in absolute value (so would not have been picked as influential in any analysis).

OK, Nothing really new here. I only managed to pick up that living conditions have a little importance. But let's look at the second components. The ten largest PC's contributions and those of the sparse component are given in Table 2.

TABLE 2

Contribution

Variable

Rj-sq

PCA

2.7%

percent of population who have immigrated within the last 10 years

1.00

2.7%

percent of population who have immigrated within the last 8 years

1.00

2.6%

percent of population who have immigrated within the last 5 years

1.00

2.6%

percent of population who have immigrated within the last 3 years

0.98

2.6%

percent of people foreign born

0.96

-2.3%

percent of people who speak only English

0.95

2.3%

percent of people who do not speak English well

0.94

2.1%

percent of persons in dense housing (more than 1 person per room)

0.86

2.0%

percentage of population that is of asian heritage

0.63

2.0%

percentage of population that is of hispanic heritage

0.90

SPCA

42%

percent of population who have immigrated within the last 10 years

0.11

-15%

percentage of population that is 65 and over in age

0.06

15%

owner occupied housing - upper quartile value

0.55

14%

percent of family households that are large (6 or more)

0.47

13%

number of people living in areas classified as urban

0.32

Wow! 9 out of 10 variables with large contributions to the second PC are percentages of immigrants! These are highly correlated (the first 7 with Rj-sq over 94%). That's right Mr Trump, immigrants do make a difference, the relative weight of their contributions to the ten most influential variables is about 91%.

But let's look at the sparse contributions: the story is quite different. The percent of population recently immigrated has a contribution of 42% but other socio-economical aspects are now influential. We find again large families (which obviously is correlated with living in dense housing) but also percentage of elderly and urbanised residents. Surprisingly, owner occupied housing has positive contribution, This may be due to families for which the cost of the mortgage is a relevant part of their income. Note how the variables in the sparse component are not very correlated.

PCA and SPCA are not methods designed for the prediction of an external variable but for summarising the variation in a set of variables with few components. However, if we regress violent crime rate on the components, it turns out that the sparse components are somewhat better at predicting crime rate than the full cardinality PCs, as shown by the R-squared coefficients in Table 3, where also other summary statistics for the SPCA results are given.

TABLE 3

Comp1

Comp2

Comp3

Comp4

Comp5

cumulative vexp

24.4%

40.8%

49.8%

57.2%

62.7%

relative cumulative vexp (sPC/PC)

96.5%

96.5%

96.5%

96.6%

96.6%

Cardinality

3

5

7

9

8

Correlation with PC

0.97

0.98

0.942

0.833

0.92

R-sq log(crime rate) on PCs

44.0%

49.0%

50.9%

51.3%

53.3%

R-sq log(crime rate) on sparse comp.

45.7%

51.9%

56.4%

57.2%

57.2%

In conclusion, SPCA does a better job at identifying uncorrelated socio-economical characteristics that make a difference in a community, while PCA tends to identify clusters of highly correlated variables sharing those characteristics. The sparse components are also (slightly) better at predicting crime rate. After all also non-immigrants may have low income and live crammed in a house.

I imagine that seeing the PCA results Mr Trump would immediately conclude that recent immigrants who don't have good command of English and maybe live in dense housing are criminals. 

The data used are quite dated. If anybody has more recent data, please let me know.

*a similar algorithm can be found in Merola, G., 2015. Least squares sparse principal component analysis: a backward elimination approach to attain large loadings. Australia & New

Zealand Journal of Statistics 57, 391–429. My R package for SPCA is available on GitHub

Sunday, 28 August 2016

Microbenchmark is a very useful function for timing the execution time of one or more programs. I often use it to compare different programs. So I created a function that produces the output relative to the time of one of the programs being tested.

For example, it is known that the singular value decomposition of a matrix is slower than the eigen-decomposition of its covariance matrix, but what about the time needed to compute the covariance? Of course it depends on the size of the matrix. The example below shows the use of mbm_rel to compare these times.

The time units can be changed, just like in the original function. For milliseconds, set unit = "ms"

mbm_rel(mbm_out, unit = "ms")

# Unit: milliseconds

# min lq mean median uq max neval

# svd time 0.706 0.716 0.807 0.735 0.754 3.459 100

# svd/eigen 1.943 1.936 2.009 1.965 1.919 3.694 100

# svd/eigenS 0.637 0.636 0.414 0.640 0.642 0.054 100

The first row gives the time for svd and the following the svd time relative to the other functions. So svd takes almost twice the time of eigen but 60% of the time of eigenS. A more informative output would be the time of the slowest. easily done, setrelpos= 3 (the third function)

Saturday, 2 July 2016

In R there isn't a function for computing the mode. This statistic is not often used but it is very useful for categorical and discrete data.

The mode is defined as "the most common value occurring in a set of observations." Mathematically, for numerical data, the mode is the centre of order zero mode = arg min_m sum [x_i - m]^0, where 0^0 is defined as equal to 0.

This definition is not complete because in a set of data there can be one or many or no mode. For example: in a set with 10 apples, 5 pears and 2 bananas the mode is apple, in a set with 5 apples, 5 pears and 2 bananas the modes are apple and pear, in a set with 5 apples, 5 pears and 5 bananas there is no mode. This is shown in the figure below.

Hence a function that computes the mode needs to return one or many or no value. Furthermore, the values returned can be either characters or numbers (in case of discrete data). My solution is this: