Navigation

Statsmodels supports a variety of approaches for analyzing contingency
tables, including methods for assessing independence, symmetry,
homogeneity, and methods for working with collections of tables from a
stratified population.

The methods described here are mainly for two-way tables. Multi-way
tables can be analyzed using log-linear models. Statsmodels does not
currently have a dedicated API for loglinear modeling, but Poisson
regression in statsmodels.genmod.GLM can be used for this
purpose.

A contingency table is a multi-way table that describes a data set in
which each observation belongs to one category for each of several
variables. For example, if there are two variables, one with
\(r\) levels and one with \(c\) levels, then we have a
\(r \times c\) contingency table. The table can be described in
terms of the number of observations that fall into a given cell of the
table, e.g. \(T_{ij}\) is the number of observations that have
level \(i\) for the first variable and level \(j\) for the
second variable. Note that each variable must have a finite number of
levels (or categories), which can be either ordered or unordered. In
different contexts, the variables defining the axes of a contingency
table may be called categorical variables or factor variables.
They may be either nominal (if their levels are unordered) or
ordinal (if their levels are ordered).

The underlying population for a contingency table is described by a
distribution table\(P_{i, j}\). The elements of \(P\)
are probabilities, and the sum of all elements in \(P\) is 1.
Methods for analyzing contingency tables use the data in \(T\) to
learn about properties of \(P\).

The statsmodels.stats.Table is the most basic class for
working with contingency tables. We can create a Table object
directly from any rectangular array-like object containing the
contingency table cell counts:

Independence is the property that the row and column factors occur
independently. Association is the lack of independence. If the
joint distribution is independent, it can be written as the outer
product of the row and column marginal distributions:

\[P_{ij} = \sum_k P_{ij} \cdot \sum_k P_{kj} \forall i, j\]

We can obtain the best-fitting independent distribution for our
observed data, and then view residuals which identify particular cells
that most strongly violate independence: