College Degrees in the US: Similarity Measures

In my last post, I used the 2016 ACS PUMS data to analyse how educational attainment and degree field choices vary between demographic groups.
I commented that the rates at which graduates pair fields together “provide insight into the intellectual connections between fields.”
This post compares different ways of estimating the strength of such connections.

Field pair co-occurrences

The repository for this post contains the files observations.csv and fields.csv, which I import as follows.

The diagonal elements of C estimate the total number of graduates with degrees in each field, while the off-diagonal elements estimate the number of graduates that chose each degree field pair.
For example, the elements of the leading submatrix

About 125,000 graduates hold degrees in Agriculture Production And Management, nearly 1,000 of which also hold degrees in Animal Sciences.
Agricultural Economics attracts about as many graduates as Food Science, but no respondents in the PUMS data reported studying both.

Similarity measures

The diagonal elements of C estimate the “size,” in units of graduates, of each degree field.
The distribution of field sizes is positively skewed, with the largest field having more than 30 times the size of the smallest 50% of fields:

Using the elements of C to measure the strength of connections between fields may lead to biased inferences by, for example, making large fields with proportionally few graduates in common appear to have stronger connections than small fields with proportionally many graduates in common.
One way to avoid such bias is to normalise each element \(c_{ij}\) of C by the corresponding field sizes \(s_i=c_{ii}\) and \(s_j=c_{jj}\), thereby producing a scale-invariant “similarity” measure between pairs of degree fields.

Dividing \(c_{ij}\) by the arithmetic mean \((s_i+s_j)/2\) yields the Dice coefficient$$\mathrm{Dice}(i,j) = \frac{2c_{ij}}{s_i+s_j},$$
while dividing \(c_{ij}\) by the geometric mean \(\sqrt{s_is_j}\) yields the Ochiai coefficient$$\mathrm{Ochiai}(i,j) = \frac{c_{ij}}{\sqrt{s_i\,s_j}}.$$
The Dice coefficient can be used to define the Jaccard index$$\begin{align} \mathrm{Jaccard}(i,j) &= \frac{c_{ij}}{s_i + s_j - c_{ij}} \\ &= \frac{\mathrm{Dice}(i,j)}{2 - \mathrm{Dice}(i,j)}, \end{align}$$
which is conceptually related to the overlap coefficient$$\mathrm{Overlap}(i,j) = \frac{c_{ij}}{\min(s_i, s_j)}$$
in that both capture the relative size of set intersections.
These four similarity measures take values on the closed unit interval \([0,1]\), with more “similar” fields achieving values closer to unity.
Indeed, one can show that
$$\mathrm{Jaccard}(i,j) \le \mathrm{Dice}(i,j) \le \mathrm{Ochiai}(i,j) \le \mathrm{Overlap}(i,j) \le 1,$$
with the two inner inequalities holding with equality if and only if \(s_i=s_j\), and with all four inequalities holding with equality if and only if \(s_i=s_j=c_{ij}\). Thus, two fields have unit similarity precisely when the sets of graduates with degrees in each field coincide.

I compute matrices of Dice, Jaccard, Ochiai and overlap similarities by defining

Ordinal properties

One way to compare similarity measures is to compare how they rank fields from most to least similar.
I do so using Kendall’s tau coefficient, which captures the extent to which two rankings agree on the relative positions of ranked entities.
Kendall’s tau is defined as
$$\tau(r_1,r_2) = \frac{2\times\text{Number of concordant pairs}}{\text{Number of pairs}} - 1,$$
where \(r_1\) and \(r_2\) are ranking functions, and where a pair \((x,y)\) of entities is “concordant” if \((r_1(x)-r_1(y))\) and \((r_2(x)-r_2(y))\) share the same sign.
If every pair is corcordant then \(\tau(r_1,r_2)=1\) and if none are concordant then \(\tau(r_1,r_2)=-1\).
The more \(r_1\) and \(r_2\) agree on the relative positions of ranked entities, the greater is the number of concordant pairs and hence the larger is \(\tau(r_1,r_2)\).

Rearranging the definition of \(\tau(r_1,r_2)\) gives
$$\Pr(\text{Pair is concordant}) = \frac{\tau(r_1, r_2) + 1}{2}.$$
Thus, computing Kendall’s tau for the rankings produced by each similarity measure, and mapping the results linearly to the unit interval, allows me to estimate the rates of agreement between different measures.
I compute these rates as follows, excluding zero and unit similarities, and report the results as a matrix.

similarities<-tibble(Dice=as.vector(dice_mat),Jaccard=as.vector(jaccard_mat),Ochiai=as.vector(ochiai_mat),Overlap=as.vector(overlap_mat),`Co-occ.`=as.vector(C)# Include for comparison)%>%filter(as.vector(upper.tri(C)&C>0))similarities%>%cor(method='kendall')%>%{(. +1)/2}%>%# Map to unit intervalround(3)

The Dice and Jaccard measures produce identical rankings, and both reach about 91% and 78% agreement with the rankings produced using the Ochiai and overlap measures.
All four measures produce rankings that reach less than 80% agreement with the ranking produced using co-occurrence counts.

The following table presents the 10 most similar field pairs using the Dice and Jaccard measures, and those pairs’ ranks using the Ochiai, overlap and co-occurrence measures.

Field 1

Field 2

Dice/Jacc. rank

Ochiai rank

Overlap rank

Co-occ. rank

Plant Science And Agronomy

Soil Science

1

1

1

127

Mathematics Teacher Education

Science And Computer Teacher Education

2

3

15

66

Biochemical Sciences

Molecular Biology

3

2

5

56

Ecology

Miscellaneous Biology

4

4

21

146

Mathematics

Physics

5

5

8

11

Political Science And Government

History

6

8

48

2

Journalism

Mass Media

7

9

30

26

Social Science Or History Teacher Education

Language And Drama Education

8

10

43

53

Accounting

Finance

9

12

32

1

Soil Science

Geosciences

10

14

53

1048

Plant Science And Agronomy and Soil Science top the rankings for all four similarity measures, despite being only the 127th most common field pair.
Biochemical Sciences and Molecular Biology, and Mathematics and Physics are the only other field pairs that rank in the top 10 most similar across all four measures.
Accounting and Finance, the most common field pair, ranks in the top 10 most similar fields using the Dice and Jaccard measures only.

Network properties

Another way to compare similarity measures is to compare properties of the networks they define.
Each similarity matrix defines a network in which nodes represent degree fields and in which edges have weight equal to the similarity between incident nodes.

The rankings of fields from most to least PageRank-central under the Dice and Jaccard measures are almost identical, and reach just over 82% agreement with the ranking produced using co-occurrence counts.

The table below presents the 10 most PageRank-central fields using the Dice measure, and the corresponding ranks using the Jaccard, Ochiai, overlap and co-occurrence measures.
The column “Size rank” orders each field from largest to smallest.

Field

Dice rank

Jaccard rank

Ochiai rank

Overlap rank

Co-occ. rank

Size rank

French German Latin And Other Common Foreign Language Studies

1

1

1

9

15

35

Mathematics

2

2

2

6

10

22

Political Science And Government

3

3

3

5

5

10

Mass Media

4

5

11

23

28

50

Molecular Biology

5

4

13

26

53

113

English Language And Literature

6

6

4

4

3

9

History

7

7

9

10

9

15

Economics

8

8

7

7

8

14

Psychology

9

9

5

3

1

3

Sociology

10

10

10

13

12

19

Languages, Mathematics, and Political Science And Government are the most PageRank-central fields under the Dice, Jaccard and Ochiai measures.
The Ochiai and overlap measures rank Mass Media and Molecular Biology relatively low on PageRank centrality, possibly due to those fields’ relatively small size.
The PageRank centralities produced using co-occurrence counts appear to correlate positively with field size, consistent with my worry that such counts may bias the measurement of intellectual connectedness in favour of larger fields.