Scaling Text with the Class Affinity Model

\fnmsPatrick O. \snmPerry\thanksrefm1label=e2]pperry@stern.nyu.edu
[\fnmsKenneth \snmBenoit\thanksreft2,m2label=e1]kbenoit@lse.ac.uk
[
New York University\thanksmarkm1 and London School of
Economics and Political Science\thanksmarkm2
Addresses:
Stern School, NYU
New York, NY 10012;
Department of Methodology, LSE, London WC2A 2AE, UK
\printeade2
E-mail: \printead*e1

Abstract

Probabilistic methods for classifying text form a rich tradition in machine
learning and natural language processing. For many important problems,
however, class prediction is uninteresting because the class is known, and
instead the focus shifts to estimating latent quantities related to the
text, such as affect or ideology. We focus on one such problem of interest,
estimating the ideological positions of 55 Irish legislators in the 1991
Dáil confidence vote. To solve the Dáil scaling problem and others
like it, we develop a text modeling framework that allows actors to take
latent positions on a “gray” spectrum between “black” and “white” polar
opposites. We are able to validate results from this model by measuring
the influences exhibited by individual words, and we are able to quantify
the uncertainty in the scaling estimates by using a sentence-level block
bootstrap. Applying our method to the Dáil debate, we are able to
scale the legislators between extreme pro-government and pro-opposition in a
way that reveals nuances in their speeches not captured by their votes or
party affiliations.

Text classification, where the goal is to infer a discrete class
label from observed text, is a core activity in statistical and machine
learning and natural language processing. Instances of this problem include
inferring authorship (Mosteller and
Wallace, 1963) or genre
Kessler et al. (1997), detecting deception (Newman, Pennebaker and
Berry, 2003), classifying
e-mail as “spam” (Heckerman et al., 1998), or detecting
sentiment (Pang, Lee and Vaithyanathan, 2002). The huge appeal of the methods developed for
these applications is that, from a small training set, it is possible to
classify a large number of unlabelled documents to reasonable accuracy without
costly human intervention.

In many applications, however, classification is an uninteresting goal, since
the correct identification of the class is obvious and costless. It is
fundamentally uninteresting, for example, to attempt to predict the political
party of a speaker or the identity of a Supreme Court justice. Furthermore,
in many social and political settings with observed discrete outcomes,
institutions may cause predicted and observed class membership to diverge in
significant ways. In parliamentary democracies where party discipline is
enforced, for instance, voting may follow party lines even if the best
predictions from observable features indicate more heterogeneous outcomes. In
such cases, it is trivial to predict class (a legislator’s vote) from
observable covariates (political party). In the presence of these covariates,
the text of a speech is ancillary to the goal of class label prediction.

Even when observing text does not improve prediction performance, it is not
the case that text is uninformative. In legislative debates, the text that
legislators generate through floor speeches may provide a direct opportunity
for them to express their contrary and divergent preferences (see for
instance Benoit and Herzog, 2012). With legal briefs, to take
another example, it is trivial to classify opinions as majority or dissenting
but using the observed text and other information it is possible to place the
briefs on a spectrum between the two extremes
(Clark and
Lauderdale, 2010). Simply attempting to predict the
category of opinion—for instance classifying amicus curiae briefs as
pro-petitioner or pro-respondent (e.g. Evans et al., 2007), is of less direct
interest since these categories are already known. The text of a document can
reveal nuances that are not captured by and sometimes in disagreement with its
class label.

Government party members

Opposition party members

Fianna Fáil (FF)

24

Democratic Left (DL)

03

Progressive Dems. (PD)

01

Fine Gael (FG)

22

Green

01

Labour (Lab)

07

Speech text

Median length (leaders)

6,348 tokens

Median length (others)

2,210 tokens

Vocabulary size

9,731 word types

Table 1: Irish Dáil debate speech statistics.

Here, we focus on an application that is ill-suited to
text classification but where text is nonetheless informative. We analyze the
1991 Irish Dáil confidence debate, previously studied by
Laver and Benoit (2002) who used the debate speeches to demonstrate their
“Wordscores” scaling method. The context is that in 1991, as the country
was coming out of a recession, a series of corruption scandals surfaced
involving improper property deals made between the government and certain
private companies. The public backlash precipitated a confidence vote in the
government, on which the legislators (each called a Teachta Dála, or TD) debated and then voted to decide whether
the current government would remain or be forced constitutionally to resign.
Table 1 summarizes the composition of the Dáil in 1991
and provides some descriptive statistics about the speech texts. We can use
the debate as a chance to learn the legislators’ ideological positions.

Because the Irish parliamentary context is characterized by strict party
discipline, the move was largely symbolic and each legislator voted strictly
with his or her party: all members of the governing parties (Fianna Fáil and
the Progressive Democrats) voted to support the government, and all members of
the opposition parties (the Democratic Left, Fine Gael, Green, and Labour)
voted against.

Despite the votes being entirely predictable, the floor speeches from the
debate before the official tally reveal nuances to legislators’ positions.
Take, for example, the following excerpt
from Noel Davern, a moderate from the Fianna Fáil party:

It is not that the financial scandals have not occurred. They have occurred and the Government have taken quick action on them. In fact, we are not fully qualified to speak on them until we see the results of the full and independent inquiry.

Davern supports the government, but at the same time does
not excuse them from all culpability.
Contrast this with a typical opposition speech, calling for a vote against the confidence motion, from Labour TD Michael Ferris:

Our decision to oppose this motion of confidence is a positive assertion of the disapproval of the ordinary people of the actions of this discredited Government. The people have watched with amazement the unfolding of scandals which have tainted this Government. The Government cannot now be said to deserve the confidence of the people.

Both legislators express views that place them somewhere between the two
extremes of absolute government support and absolute opposition support.

Where do Davern, Ferris, and the other 56 TDs that participated
in the debate lie on this ideological spectrum? This is the essential question
that we attack in this manuscript. In answering the question, we have at our
disposal the speech texts, along with some additional information. We know
that the leader of the government (Haughey, the Fianna FáilTaoiseach)
will give a speech at one extreme of the pro-government
spectrum, and we know that the heads of the two major opposition parties
(Spring and De Rossa, the Labour the Democratic Left leaders) will be at the
extreme of the other end. We will use these three texts as reference points
by which to scale the other 55 ambiguous texts whose positions are unknown
and must be estimated.

To solve our particular problem, we develop a new text scaling method
that is broadly applicable to situations where most documents are unlabelled
but we have a few examples of documents at the extremes of a hypothesized
ideological or stylistic spectrum. Instead of predicting class membership,
our objective in such problems is to scale a continuous characteristic,
through measuring the fit of a text to a set of known classes based on its
degree of similarity to typical texts from these classes.

In what follows, we develop the class affinity model and demonstrate
its use in
scaling the degree of support or opposition expressed in the speeches made
during the confidence debate. We start by outlining the foundations of our
scaling
model, contrasting it first to similar approaches designed for classification
(Section 2), and then to lexicographical
association methods in the form of sentiment dictionaries (Section
3). Section 4 then sets out the model,
comparing this to related methods, highlighting the differences through on
statistical principles but also using our application. Sections
5 and 6 detail how
this model and its reference distributions are estimated, while Section
7 relates the affinity model to related methods. In
Section 8, we show how to measure the influence of
individual words, and provide recommendations for removing common terms that
might skew the results. We apply this procedure to choose a tailored
vocabulary for our application in Section 9.
Section 10
demonstrates how to estimate uncertainty for the class affinity scaled
estimates. Finally, we summarize the results the results of fitting the class
affinity model to our application (Section 11), and offer some
concluding remarks.

We have stated repeatedly that classification is not our objective in this
problem, but nonetheless there is a long tradition of fitting classification
methods to text, and we might try applying one of those methods here. We have
a “training set” of the three leadership speeches, one of which we can label
as Government and two as Opposition. We can fit a supervised classification method to
this training set and then use it to make predictions for the other 55
legislators.

Using the Naive Bayes text classification method popularized by
Sahami et al. (1998), we would model the tokens in each speech text as
independent draws from a label-dependent distribution estimated from the
reference texts. Letting label k=1 denote Government and label k=2 denote Opposition, for each label k∈{1,2} and word type
v in our vocabulary V, we would estimate pkv, the probability
that a random token drawn from a text with label k is equal to v.
Typically we use the empirical word occurrence frequencies in the reference
documents or some smoothed version thereof. Here and throughout the text,
unless otherwise noted we will take our vocabulary to be the set of word types
that appear at least twice in the leadership speeches, excluding common
function words from the modified Snowball stop word list distributed with the
quanteda software package (Porter, 2006; Benoit, 2017); we ignore words
outside this set.

Under the “naive” assumption
that tokens in a text are independent draws from the same distribution,
assuming equal prior odds for each label, the log-odds that the label is
Government given the word counts x=(xv)v∈V is

η(x)=∑v∈Vxvlog(p1v/p2v),

where xv denotes the number of times that word type v appears in the text.
The expression for η(x) arises as the log ratio of two multinomial
likelihoods with probability vectors p1 and p2. Using Naive Bayes
classification for this two-class prediction problem,
we would predict the label as Government when
η(x)>0, and we would predict the label as Opposition when
η(x)<0.

The quantity η(x) measures the strength of the evidence that the label of
a text is Government or Opposition, and we can use this quantity
to scale the 55 virgin texts.
Unfortunately, the Naive Bayes scaling method has serious drawbacks. First,
the estimated log odds tend to be absurdly high. On our example, the median
absolute log odds is 197.8, corresponding to an unrealistically high
probability of class membership exceeding 1−10−85.
Second, because η(x) is measuring the strength of the
evidence, longer texts will tend to have higher absolute log odds. We
illustrate both of these defects in Fig. 1, where we
plot the absolute odds of class membership as a function of text length.

Related methods suffer from versions of this same problem. Multinomial
inverse regression (Taddy, 2013) regularizes the probability vector
estimates p1 and p2 adds a calibration step to the log-odds, but it
still suffers from the same drawbacks as Naive Bayes. Discriminative methods,
like those used by Joachims (1998) and Jia et al. (2014), are affected to a degree depending on their choice of
features. With logistic regression, for example, when the features are linear
functions of the counts x, then it will still be the case that longer
documents have more extreme counts and hence more extreme predictions. Other
choices of predictors can give rise to predictors that are less sensitive to
variations in document length.

Even if these classification methods did not suffer the defects noted above,
there is still a fundamental disconnect between the classification philosophy
and the goals of scaling. In the classification world, a document is either
“black” or “white;” for an unlabelled document, the method will tell you
the probability that the label is black. In reality, though, a text is
“gray,” a mixture of black and white. This is a fundamental difference in
perspective that precludes using a classification method for our task. We
expand on this metaphor below.

Not all text scaling methods take the black-and-white classification view of
the world. One of the most successful alternatives is dictionary-based scaling
(Stone, Dunphy and Smith, 1966; Pennebaker, Francis and
Booth, 2001; Hu and Liu, 2004). In their simplest forms,
dictionary methods conceive as each text as a mixture of two contrasting
poles, such as positive and negative. Neutral words get discarded from the
vocabulary. The scaling of a text is determined by the average orientation
of its tokens.

There are many variations of dictionary-based scaling but for concreteness we will focus on \NAT@swafalse\NAT@partrue\NAT@fullfalse\NAT@citetpGrimmerStewart2013
formulation. To
apply that scaling to the problem at hand—scaling debate speeches—we would
need two non-overlapping lists: one of words associated with
Government and one of words associated with Opposition. Given these lists, we
would assign a score sv=+1 to each word type v in the
Government list, and a score sv=−1 to each word type v in
the Opposition list. The dictionary-based scaling of a text with token
count vector x would be

t(x)=1n∑v∈Vxvsv,

where n=∑v∈Vxv; this quantity is equal to the
difference in word type occurrence rates between the Government and
Opposition lists.

It is labor-intensive and error-prone to build a custom dictionary for
each application, so
often when practitioners apply dictionary scaling methods, they use
off-the-shelf dictionaries instead of building their own. For our application,
the Lexicoder sentiment dictionary (LSD, 2015 version), “a broad lexicon
scored for positive and negative tone and tailored primarily to political
texts,” would be a natural choice (Young and Soroka, 2012, 211). However, as
those authors note, applying an
off-the-shelf dictionary to a new domain often leads to undesirable results.
Table 2 illustrates this point in the context of our
application by comparing the word orientations as determined by the LSD with
their empirical associations with Government and Opposition as
observed in the leadership speeches. The rows indicate the LSD-assigned
orientations of the words; the columns are the significant differences in
usage rates between the two classes as measured by the “keyness” G2
likelihood ratio score at significance level 0.05, taking negations into
account as
recommended by Young and Soroka (2012). We display the number of word types in
each cell, along with the most common words.

If the dictionary were appropriate for our application, we should observe
positive words associated with government usage, and negative words associated
with opposition usage. The patterns in Table 2,
however, show a very different result. Only 11 “positive” words have high
usage in the government leadership speech, and no “negative” words
have high usage in the opposition leadership speeches. Most
“positive” and “negative” words do not have a clear association with
either Government or Opposition. Furthermore, there are some
worrying cases where the dictionary orientation is counter to the association
between the classes. For example, while the LSD declares the word to be
negative, in the
context of the debate deficit refers simply to a fiscal outcome;
likewise, confidence is related to the question of the debate, and not
intended to convey positive valence.
Despite being designed to detect political valence,
the dictionary fails here since it has not been tailored for this
particular debate. Terms that are associated with one type of affect
generally are used differently in the context of the no-confidence debate.

Beyond the problem of domain adaptation, the more fundamental
issue with dictionary methods is that their basic premise—that each word has
a clear orientation—is inappropriate in our domain. Most words in our
application do not clearly either belong in one category or the other. We can
seen this in Table 2, where over 95% of the word
types do not have statistically significantly different usage
rates between the government and opposition leadership speeches.
The vast majority of words get used by both government and opposition, and
thus have mixed associations with both classes. Some dictionaries try to
adjust for this by giving non-binary scores to the words
(Bradley and Lang, 1999), but these adjustments are often ad hoc, and
they suffer from the same domain adaption problems. In the sequel, we present an
alternative method that allows for mixed word association while simultaneously
adapting to the domain.

Classification methods assume that each text is a member a well-defined
category. Dictionary methods do not make this strong assumption, but they too
take an unrealistic view of the world by supposing that each word has a
well-defined orientation. Table 3 highlights this
difference, and makes clear that there is room for a third worldview allowing
both texts and words to be gray. We will formalize this intuition in a
statistical model that we refer to as the “affinity model.”

Documents

Gray

B/W

Gray

Affinity Model

Classification

Words

B/W

Dictionaries

Table 3: Word- and document-level assumptions from three scaling methods.

Our basic conceptual model is that over the course of a speech, a speaker’s
orientation switches back and forth between Government mode and
Opposition mode. When she is in Government mode, she chooses
words in the same manner as the government leadership. Likewise, when she is
Opposition mode, she chooses words in the same manner as the opposition
leadership. We should place the speaker on the spectrum between the two
extremes of pro-government and pro-opposition according to what proportion of
time she spends in each mode.

Formally, let V denote the vocabulary of word types, a set with
cardinality |V|=V. Encode the text of a speech as a sequence of
tokens W=(W1,W2,…,Wn), with each token Wi belonging to
V. In our model, the speaker’s underlying orientation evolves in
parallel to the text and can be represented as U=(U1,U2,…,Un)
where for i=1,…,n the value Ui denotes the speaker’s underlying
orientation while uttering token Wi. We will in general suppose that there
are K possible orientations, identified with the labels 1,…,K.

In our conceptual framework, a speech and the corresponding underlying
orientation sequence are realizations of some speaker-specific random process.
For k=1,…,K, we define a speaker’s affinity toward orientation k
as θk, the expected proportion of time that her underlying orientation
is k:

θk=E{1nn∑i=1Ui}.

Each speaker has an underlying affinity vector
θ=(θ1,…,θK).

In our specific application, there are K=2 orientations. Each debate
speaker has a separate affinity vector θ=(θ1,θ2). We
will scale each speaker by estimating his or her affinities for
Government (θ1) and Opposition (θ2).

We will impose two simplifying assumptions to make inference under our model
tractable. First, we will suppose that U1,U2,…,Un are independent
and identically distributed. This forces that for every label k, and
position i, the underlying orientation is randomly distributed with
Pr(Ui=k)=θk.
Second, we will suppose that W1,W2,…,Wn are independent conditional on U, and that the distribution of
Wi∣U depends only on Ui and is the same for all positions i.
This positional invariance allows us to define
for each label k and word type v the probability

pkv=Pr(Wi=v∣Ui=k)

and it allows us to define the reference distribution pk=(pkv)v∈V. Our two simplifying assumptions result in a generative model: for
each position i=1,…,n, the speaker picks an underlying orientation
with probabilities determined by θ; given that the underlying
orientation is Ui=k, the speaker picks token Wi according to
distribution pk. Fig. 2(a) summarizes this generative
process.

Speaker affinity

Intended class

Observed words

θ

U1

U2

⋯

Un

W1

W2

⋯

Wn

θ

U

W1

W2

⋯

Wn

(a) Class affinity model

(b) Classification Model

Figure 2: Generative model for the underlying orientation U
and the token sequence W, contrasting the class affinity model to the classification model.

For each position i=1,…,n, the chance that word
v appears in position i is

Pr(Wi=v)

=K∑k=1Pr(Ui=k)Pr(Wi=v∣Ui=k)=K∑k=1θkpkv.

Further, W1,W2,…,Wn are independent, so that the probability
of observing the token sequence
w=(w1,…,wn) is

Pr(W=w)=n∏i=1(K∑k=1θkpkwi)=∏v∈V(K∑k=1θkpkv)xv,

(1)

where xv is the number of times word v appears in the text.
At a high level, this is the same generative model as that used for a
topic model (Blei, Ng and Jordan, 2003). The main difference
between these models is that topic models are typically unsupervised, but
the affinity model uses supervision to estimate p1,p2,…,pK.
We elaborate more on the connection to topic models in
Section 7.4.

We note also that the affinity model can be seen as a generalization of the
Naive Bayes model depicted in Fig. 2(b). In the Naive
Bayes model, each document has a single underlying orientation, U. All words
in the document share the same underlying orientation. The parameter θ
can be seen as the prior distribution for U. In Naive Bayes, we do not
estimate θ, but instead we estimate Pr(U=k∣X1,…,Xn)
for each class k. In Naive Bayes, each document has just one underlying
orientation. The power of the affinity model is that it allows the underlying
orientation to vary with the word position.

The affinity model described in Section 4 lends itself naturally
to likelihood-based estimation. We first consider the problem of estimating
the affinity vector θ for a particular text, when we are given the
reference distributions p1,…,pK.

The parameter space for the affinity vector is the simplex Θ⊂RK consisting of all vectors θ with non-negative components
satisfying the equality constraint ∑Kk=1θk=1.
One implication of the equality constraint is that the model
is over-parametrized, which makes estimating θ directly awkward.
To handle this constraint, we will
reparametrize the model in terms of a (K−1)-dimensional contrast
vector β.

In the K=2 case, we set
β=(θ2−θ1)/2,
so that
θ1=1/2−β
and
θ2=1/2+β;
the parameter space for β is B=[−1/2,1/2]. In the
general case we let β be defined by the relation

θ=θ0+Cβ,

(2)

where θ0 is any point in the interior of the parameter space and
the contrast matrix C∈RK×(K−1) has full rank and
satisfies CT1=0. In principle θ0 and C can be arbitrary, but
for concreteness we will take θ0 to be the center of the parameter
space θ0=(1/K,1/K,…,1/K), and we will take C to be the
Helmert matrix. The parameter space for the contrast vector, then, is
B={β∈RK−1:θ0+Cβ⪰0},
where ⪰ denotes component-wise partial order.
With this particular choice of θ0 and C, the general case agrees
with the special case when K=2.

Following equation (1), the log-likelihood function
for the contrast vector is

l(β)=∑v∈Vxvlogμv,

(3)

where
μv=∑Kk=1θkpkv
and
θ=θ(β).
We will estimate β by maximizing l(β) or a penalized version
thereof.

In the special case when K=2, the score and observed information
functions gotten from differentiating the log likelihood are

u(β)

=l′(β)=∑v∈Vp2v−p1vμvxv,

I(β)

=−l′′(β)=∑v∈V(p2v−p1v)2μ2vxv.

The expected information is

i(β)=E{I(β)}=n∑v∈V(p2v−p1v)2μv.

To define the analogous functions in the general case, define the
matrix-valued function
Q=Q(β)∈RK×V
with Qkv=pkv/μv.
In the general case, the analogous functions are

u(β)

=CTQx,

(4)

I(β)

=CTQXQTC,

(5)

where X∈RV×V is the diagonal matrix
with Xvv=xv for v∈V.
The expected information is

i(β)=nCTQPTC=nCTPQTC,

where P∈RK×V is the matrix with kth row
equal to pTk for k=1,…,K.

The observed information function I(β) is positive semidefinite,
indicating that the log likelihood function l(β) is concave. We can
estimate β by maximizing the log likelihood using the Newton-Raphson
iterative method. The expensive part of this maximization procedure is
computing I(β), which takes time O(VK2), or faster if the count
vector x is sparse. In our experience on the Dáil speeches, the method
typically converges after about five iterations. The difficult part of the
optimization is that we must restrict the search to the parameter space
B; we accomplish this using an interior-point barrier
method (Boyd and
Vandenberghe, 2004, Ch. 11).

In exchange for adding a small bias to the estimates, we can reduce the
variance and remove the explicit inequality constraints on the parameter
space. In particular, Firth (1993) shows that in the
asymptotic regime where n tends to infinity, adding a penalty of
order O(1) to a log likelihood adds a term of size O(1/n) to the
bias of the estimator (sometimes reducing the estimator’s bias, but not
necessarily doing so in our setting). In our case, we choose a positive
scalar λ and define the penalty function

ψλ(θ)=λK∑k=1logθk.

Then, we estimate the affinities by maximizing the penalized log likelihood
~lλ(β)=l(β)+ψλ(θ),
where θ=θ(β). The penalty ensures that
~lλ is strictly concave, and further that the
maximizer ^βλ is unique and belongs to the interior
of the parameter space. For the analyses in this manuscript, we use the
penalty value λ=0.5. Section 6 provides
some theoretical justification for this penalty value in a related context.

The reference distributions p1,p2,…,pK themselves need to be
estimated from data. In our framework, this learning step requires not
large volumes of training data, but rather texts that are clearly polar examples
of each reference class, to form benchmarks for estimating the other texts’
affinities to these classes. In the context of our specific application,
the 1991 Irish Dáil confidence debate, recall that
the contrasting K=2 classes represent Government (k=1) and
Opposition (k=2). We will use the leaders of the government and opposition
respectively
to represent the archetype texts for each class. Taoiseach (Prime Minister)
Charles Haughey’s speech forms the government reference text for estimating
p1, and the speeches from the two opposition party leaders (Spring and de
Rossa) form the reference texts for estimating p2.

To estimate a particular reference distribution p, we will suppose in
general that we
have at our disposal m texts drawn from this distribution of lengths n1,n2,…,nm. We denote the vectors of word counts for these texts by
x1,x2,…,xm. In our application, m=1 for estimating the
Government reference, and m=2 for estimating the Opposition reference. We will use
smoothed empirical frequencies to estimate pv as
advocated by Lidstone (1920). We choose a nonnegative smoothing constant
α and estimate the probability of word type v as

^pv=(α+m∑j=1xjv)/(Vα+m∑j=1nj).

Specifically, we will set α=0.5.
It is not essential to smooth the estimates of p, but doing so reduces
estimation variability.

There are many reasonable choices for the smoothing constant α,
including choosing α adaptively (Fienberg and
Holland, 1972). In
natural language processing, it is common to take α=1 so that ^p is the maximum a posteriori estimator under a uniform prior
(Jurafsky and
Martin, 2009, Sec. 4.5.1). From a frequentist standpoint, the
value α=0.5—which corresponds to using a Jeffreys prior
for p—is slightly more defensible. In the regime where V is fixed and
n tends to infinity, using the results from Firth (1993) one can show
that using α=0.5 results in an expected Kullback-Leibler divergence
from ^p to p of order O(n−3/2) instead of O(n−1) for
other choices of α.

Once we have estimates ^p1,^p2,…,^pK of the reference
distributions, to get an estimate of the class affinity vector θ for a
text, we use the methods from Section 5, using
the estimated class distributions in place of their true values.

7.1 Dictionary methods

In the special case that the reference distributions p1,p2,…,pK
have disjoint supports—that is, when no two classes k and
l are such that both pk(v)>0 and pl(v)>0 for some word type
v—affinity scaling is exactly equivalent to dictionary scaling.

To make this equivalence clear, suppose that for each word type v∈V, at most one of the reference probabilities p1v,p2v,…,pKv is nonzero. When this is the case, we can partition the vocabulary as
a union of disjoint sets, V=V1∪V2∪⋯∪VK, where

Vk={v∈V:pkv>0}.

Here, Vk is the set of word types associated with label k. The
disjoint support condition ensures that each word type v is associated with
exactly one label.

Under the disjoint support condition, when we observe the ith token
wi, we can
immediately infer the underlying orientation ui to be the only class with
this word in its support. The log-likelihood simplifies to

l(θ)

=∑v∈Vxvlog(K∑k=1θkpkv)

=K∑k=1∑v∈Vkxvlog(θkpkv)

=K∑k=1nklogθk+(constant),

where nk=∑v∈Vkxv and the constant does not depend on
θ. In this case, the maximum likelihood estimate of the class affinity
vector is

^θ=(n1n,n2n,…,nKn).

That is, the estimated class affinities are the token occurrence rates in the
support sets V1,V2,…,VK.

7.2 Wordscores

The “Wordscores” scaling method developed by Laver, Benoit and Garry (2003) turns out to
be closely related to class affinity scaling. That method, which is primarily
used to
scale documents between K=2 reference classes works well in practice but
has been criticized for having ad hoc theoretical
foundations (Lowe, 2008). We can show, however, that Wordscores scaling is
closely related to affinity scaling, and gives highly correlated results for
texts that are not close to the extremes (represented by the reference text
positions). We elaborate on this connection below.

In its simplest form, Wordscores takes as given
reference distributions for each class, denoted p1 and p2. The method
defines the wordscore of a word type v∈V as

sv=p2v−p1vp1v+p2v.

(6)

Word types that only appear in class 2 have scores of +1, while types that
only appear in class 1 have scores of −1. Other types have intermediate
values indicating the relative degrees of association with the two classes.
The unnormalized “text score” of a length-n text with token count vector x
is then the average wordscore of its tokens:

t(x)=1n∑v∈Vp2v−p1vp1v+p2vxv,

(7)

Texts with positive t(x) values tend to be more like class 2, while
texts with negative t(x) values tend to be more like class 1.

The magnitude of the unnormalized score t(x) is not directly
interpretable. To fix this,
Martin and Vanberg (2007) advocate rescaling the score to ensure
that average reference texts from the two classes have scores of
−1 and +1. To realize the Martin–Vanberg scaling, for k=1,2 define

tk=∑v∈Vp2v−p1vp1v+p2vpkv.

An average text of length n from class k has token counts
satisfying xv/n=pkv, so that its score is t(x)=tk.
Using the relation
p1v/(p1v+p2v)=1−p2v/(p1v+p2v)
termwise in the sum, one can verify that t1=−t2.
The Martin–Vanberg wordscore scaling is

~t(x)

=−t2+t1t2−t1+t(x)⋅2t2−t1=t(x)/t2.

An average text x from class 1 satisfies
~t(x)=−1; an average text x′ from class 2 satisfies
~t(x′)=+1.

The wordscore scaling ~t(x) turns out to be deeply connected to
affinity scaling. To see this connection, note that using the parameterization
from Section 5, the score and observed information
functions for the affinity model evaluated at β=0 are

u(0)

=2∑v∈Vp2v−p1vp1v+p2vxv=2nt(x),

i(0)

=2n∑v∈V(p2v−p1v)2p1v+p2v=2n(t2−t1).

There is a striking relationship between the scaled text score and
the derivatives of the mixture model log likelihood:

~t(x)/2={i(0)}−1u(0).

The right hand side of this expression is equal to the first Fisher scoring
iterate computed while maximizing l(β) starting from the initial value
β=0. When the maximizer ^β is close to 0, it will
be approximately equal to this first iterate. Thus, when a text is
roughly balanced between the two reference classes (^β≈0),
it will also be the case that

~t(x)≈2^β=^θ2−^θ1.

For moderate documents, the wordscore scaling is a linear transformation of
the estimated class affinities.

We demonstrate the quality of this approximation in
Fig. 2(d), where we plot the wordscore scaling versus
the estimated government affinity for the moderate debate speeches.
We can see that there is very good agreement between the two scalings, and
that ~t(x)≈0, the two scalings are almost identical.

7.3 Support vector machines and logistic regression

We have just shown analytically that affinity scaling gives similar results to
Wordscores. It turns out that, when the number of reference documents is
small, up to scaling, both methods are approximately equivalent to classifying
with a support vector machine or linear regression.

Suppose that we are in the two-class (K=2) case, and that there is one
reference document for each class. Imagine fitting a linear classifier that
tries to predict class using a document’s word frequencies as features. With a
vocabulary size V greater than the number of training documents, the two
classes can be perfectly separated as long as the two reference distributions
p1 and p2 corresponding to the training documents are identical. In this
case, the support vector machine fit and the logistic regression fit are
identical, up to differences that arise from regularizing the coefficients.

Given a document with length n and word count vector x, its feature vector
is its vector of word frequencies, n−1x. The feature vectors for the two
training documents are p1 and p2. Up to a constant of proportionality,
the maximum margin predictor, expressed as a function of x is

η(x)

=(p2−p1)T{n−1x−(1/2)(p1+p2)}

=1n∑v∈V(p2v−p1v)xv+(%
const.)

(8)

Since the classes are perfectly separated, and multiple of this predictor
gives the same classification performance on the training set; the precise
scaling chosen by the fitting procedure will depend on the regularization
parameters.

Comparing the support vector machine scaling (8) with the
unnormalized wordscores scaling (7), we can see
that the only substantive difference is the denominator p1v+p2v in
the coefficient on xv. Thus, up to a constant shift and scale, if
p1v+p2v is roughly constant relative to p2v−p1v, then the
two methods will give similar results. In light of the connection between
Wordscores and affinity scaling developed in Sec. 7.2, this
implies that in these situations, the support vector machine results will be
highly correlated with the affinity scaling results.

We verified the connection between the two methods empirically, using the
SVMlight software with the default tuning
parameters (Joachims, 1999). Fig. 2(b) shows the support
vector machine estimated log odds plotted against the affinity scaling
results. Both scalings give similar results (correlation 0.92). The main
distinction is that the numerical value of the support vector machine log odds
is determined completely by the regularization parameter and is thus
uninterpretable. The affinity scaling of a document, by contrast, can be
interpreted directly.

7.4 Topic models

Topic models share a similar perspective with the affinity model in that both
represent texts as mixtures of topics, with each topic having an associated
word distribution. In our framework, the topics correspond to the reference
classes, and the text-specific topic weights correspond to class affinities.
We learn the class distributions from a set of labeled reference texts. This
approach differs from that taken by unsupervised topic models
(Blei, Ng and Jordan, 2003; Grimmer, 2010), where estimated topics may or may
not correspond to scaling quantities of interest.

Supervised variants of topic models allow for associations between labels and
topics, but these models all assume that class membership is discrete, not a
continuous scale
(McAuliffe and Blei, 2008; Ramage et al., 2009; Roberts, Stewart and
Airoldi, 2016). These supervised models
force clear associations between the topics and the scaling quantities of
interest, but they assume that the texts have discrete labels
indicating class membership. This fundamental assumption places these methods
in the same category as other classification methods like Naive Bayes,
estimating the probability of class membership, not class affinity.

Despite their philosophical differences, in practice supervised topic models
can give scalings that are highly correlated with the affinity model scaling.
The connection to supervised topic models is easiest to understand in the case
of \NAT@swafalse\NAT@partrue\NAT@fullfalse\NAT@citetpMcauliffeBlei2008 Supervised Latent Dirichlet Allocation (sLDA),
which models a text-specific label as a random quantity linked to a linear
function of the text-specific topic weights. Roughly speaking, the method
works in two stages. In the first stage, sLDA fits a topic model to the
reference texts. In the second stage, sLDA fits a logistic regression model
using the fitted topic weights as predictors and the class label as response.
In practice, sLDA fits the topics and the logistic regression simultaneously,
but when the number of topics is larger than the number of reference texts,
any differences between sequential and simultaneous fitting are determined by
the regularization parameters and the random initialization.

The connection between sLDA and affinity model scaling is closest with
two topics and two reference texts. In this case, since the number of topics
equals the number of reference texts, sLDA can get a perfect fit by allocating
one topic to each reference text, and can separate the two classes perfectly
given the topic weights (^θ1,^θ2) by using a
linear predictor for the odds of class membership of the form η=b(^θ2−^θ1),
where the coefficient b gets determined by the regularization parameters.
When the sLDA fit gets used for prediction on
the unlabelled texts, the fitted topic weights (^θ1,^θ2)
will be the same as the values from a fitted affinity model (again, ignoring
the effects of regularization regularization and initialization).
The sLDA score will be highly correlated with the difference in estimated affinities.

In the case when there are more topics and more reference texts, the
relationship between affinity scaling and sLDA is not as simple, but the same
general intuition still holds and the two methods still give highly correlated
results. Fig. 2(e) illustrates this with a model using 10
topics, where the correlation between the non-reference text scalings from the
two methods is 0.98. Here, the sLDA method gives unreasonable results for
the extremes. Furthermore, the interpretation of the scaling value if different:
odds of class membership for sLDA, versus degree of membership for the
affinity model.

7.5 Unsupervised methods

Some approaches to scaling texts, including Latent Semantic Indexing
(Deerwester et al., 1990) and Slapin and Proksch (2008)’s “Wordfish” Poisson
scaling method, estimate latent text-specific traits using unsupervised
methods. Often, the estimated traits are correlated with recognizable
attributes, and so they can be used to scale ideology.
Letting xiv denote the count of word type v in text i,
the Slapin and Proksch (2008) Wordfish model specifies that
xiv is a Poisson random variable with mean λiv, where
logλiv=αi+ψv+θiβv
for some unknown text-specific parameters (αi and
θi) and word-specific parameters (ψv and βv).
Estimates of θi have been shown to provide valid estimates of latent
positions expressed in speeches (Lowe and Benoit, 2013).

The drawback to unsupervised scaling of this sort, however, is that they
provide no guarantee that the estimated latent trait corresponds to the
quantity of interest. We demonstrate this behavior in
Fig. 2(f), where we plot the Wordfish scaling estimates of
the debate speeches versus the affinity scaling estimates. The two methods give
similar results (correlation 0.82), but there are also some notable
differences. The government and opposition leaders are not the most extreme
examples as determined by Wordfish, indicating that even in this focused
context—a debate over a confidence motion—the primary dimension
of difference is something other than the government-opposition divide.

In the previous section, we used the simple analytic form of the affinity
scaling model to get an understanding of its
connections with other text scaling methods. Beyond this, we will now see
another advantage of the model’s form: its simplicity facilitates
computationally efficient diagnostic checking for the model fit.

Ideally, our fit should exhibit two characteristics. First, it should not be
driven by a small number of word types, but instead it should be determined by
an accumulation of information from many different word types. Second, the
word types that show the most influence in determining the fit should be ones
that make sense from a subject matter perspective. To check whether our
scaling results satisfy these properties, and to better understand them
generally, we will develop an influence measure to characterize the impact of
each word type in determining the overall fit.

Our strategy for assessing influence stems from Cook (1977), who,
in the context of linear regression, assesses the influence of each
observation by measuring the change that results from deleting the
observation. Proceeding analogously, we will measure the influence of a word
type v∈V by setting the corresponding token count xv to zero and
observing the change in the class affinity estimate ^θ. Ideally,
we would do this by computing the maximizer ^θ(v) of the log
likelihood (or, when regularizing, the penalized log likelihood) gotten after
setting xv to zero, but the large number of word types makes this
impractical. We will settle for finding a computationally simple closed-form
approximation to ^θ(v).

Suppose that x is a vector of token counts for the particular text of
interest, and that ^θ=θ0+C^β is the affinity
vector estimate gotten from ^β, the maximizer of the
corresponding log likelihood l(β) defined in (3).
Making the dependence on x explicit, the score and observed information
functions are

u(β;x)=CTQx,I(β;x)=CTQXQTC,

where X∈RV×V is a diagonal matrix with
Xvv=xv for v∈V and Q=Q(β) is as defined in
Section 5.

For an arbitrary word type v∈V, consider the effect of setting
xv=0. This defines a new vector of token counts x(v) defined by
x(v)v=0 and x(v)w=xw for all w≠v. Let ev denote
the vth standard basis vector in RV and define
hv=CT^Qev, where ^Q=Q(^β). Note that
x=x(v)+xvev, so that

u(^β;x)=u(^β;x(v))+xvhv,I(^β;x)=I(^β;x(v))+xvhvhTv.

Since u(^β;x)=0, this implies that evaluating the score
function with the new data at the old estimate gives

u(^β;x(v))=−xvhv.

(9)

The maximizer ^β(v) of the new log likelihood is roughly equal to
the first Newton scoring step from ^β. We can compute this
step explicitly by first computing the inverse of the observed information
matrix:

{I(^β;x(v))}−1

={I(^β;x)−xvhvhTv}−1

={I(^β;x)}−1+(x−1v−~hTvhv)−1~hv~hTv

(10)

where ~hv={I(^β;x)}−1hv.

Approximating the maximizer by the first Newton step from ^β
gives

^β(v)

≈^β+{I(^β;x(v))}−1u(^β;x(v))

=^β−(x−1v−~hTvhv)−1~hv,

where we have used (9) and (10) to
simplify the expression. Using this approximation for ^β(v) gives
us an approximation for the change in the estimated affinities:

^θ−^θ(v)

=C^β−C^β(v)

≈(x−1v−~hTvhv)−1C~hv.

Motivated by this approximation, we define our influence measure as

dv=(1/2)∥(x−1v−~hTvhv)−1C~hv∥1,

(11)

where ∥⋅∥1 denoteds 1-norm.
When we are regularizing the estimates, using a penalized log likelihood
~l(β;x) in place of l(β;x), we define the influence
similarly, using the negative Hessian −∇2β~l(β;x) in
place of I(β;x).

Using a 1-norm instead of a Euclidean norm in the definition of dv
allows us to interpret dv as the total amount of positive change to the
components of ^θ. Given that
1T(^θ−^θ(v))=0,
this is also equal to the total amount of negative change.

As previously mentioned, the results presented in Fig. 3
and elsewhere in the prequel use as vocabulary the set of word types appearing
in the leadership speeches, excluding words appearing only once and words
on the English Snowball “stop” word list. Why did we exclude these words?

Initially, we did not exclude any words from the vocabulary. We fit the
affinity model to the complete vocabulary and used it to scale the 55
non-leadership speeches. Then, to help understand our results, we computed the
influence measures as defined in (11) for each speech word
count vector x and word type v. We also recorded the direction of the
influence (whether the appearance of the word pushes the fit towards
Government or Opposition). This gave us a 55×9731
matrix of (speech, word) influence measures. Most of the entries of this
matrix are zero since most count vectors x are sparse and words that do not
appear in a speech have no influence on its affinity estimate. For each word
type, we recorded the count of nonzero speech influence entries, along with
the median and maximum of the nonzero entries. We report these values in
Table 4, grouped by the direction of influence.

Government

Opposition

Word

Count

Median

Max

Word

Count

Median

Max

and

55

1.3

2.5

the

55

2.5

4.7

our

49

0.9

2.7

that

55

1.3

3.5

graduate

3

0.8

0.9

to

55

1.2

2.6

deasy

3

0.7

1.6

they

55

1.0

2.6

attribute

1

0.7

0.7

a

55

0.9

1.7

social

30

0.6

8.0

is

55

0.9

1.7

per cent

26

0.6

3.2

not

55

0.7

1.6

corresponding

1

0.6

0.6

people

54

0.7

3.0

nation

12

0.6

1.4

it

55

0.7

1.7

proof

2

0.6

1.0

he

42

0.6

2.0

1987

20

0.5

2.7

at

54

0.5

1.3

economic

33

0.5

2.1

his

43

0.5

1.4

will

55

0.5

1.5

taoiseach

43

0.5

1.3

international

18

0.5

1.1

by

55

0.4

0.7

union

9

0.5

0.9

as

55

0.4

1.2

Table 4: Median and maximum influence (×100) exerted by
the most influential words, grouped by direction of influence.
Medians are computed over texts containing the word.

We can see, for example, that the word type social exhibited influence on
30 speeches. For one of these speeches, deleting the word social has the
affect of shifting the speech’s affinity estimate away from
Government by 0.08; the median shift for the 30 speeches is 0.006. Deleting
social shifts the fit away from Government; equivalently, the
appearances of social push the fit towards Government.

The influence of a word is determined by its usage rate and the degree to
which is usage is imbalanced across the reference classes. The word types
that show up as influential in Table 4 are those that
appear frequently and exhibit a small imbalance between Government and
Opposition, or else appear moderately and exhibit a large imbalance
between the two classes. This holds generally: influential words tend to
either be highly imbalanced, or moderately imbalanced with high usage rates.

Many of the of the words in Table 4 make sense, for
example social, nation, and economic influence the
affinity fit
towards Government, and people and taoiseach influence the
affinity fit towards Opposition. However, we can clearly see that
certain function words like and and the are exerting a big
influence
on the fit. These function words have slightly imbalanced usage rates in the
reference texts, which, compounded with a high usage rate, results in a large
net influence. This sensitivity to stylistic differences is a manifestation
of a common critique of the related Wordscores scaling method
(Beauchamp, 2012; Grimmer and
Stewart, 2013). To reduce sensitivity to stylistic
differences, we eliminated function words (the Snowball English “stop” words)
from our analysis.

We can also see in Table 4 that there are words that a
few rare words like attribute and proof have large influence.
These
words are not meaningful discriminators on substantive grounds, but they show
up as influential because they only appear once in the reference speeches. The
estimated probabilities for these words are unreliable. Their influence is
determined purely by estimation variability. To get around this, in our final
analysis we choose to exclude these
words—the hapax legomena—that only appear once in the reference
speeches.

Government

Opposition

Word

Count

Median

Max

Word

Count

Median

Max

deasy

3

0.9

1.9

people

54

1.3

5.0

per cent

26

0.8

3.7

taoiseach

43

0.8

3.1

nation

12

0.8

1.8

democrats

23

0.7

1.9

social

30

0.8

10.7

minister

44

0.6

2.5

corresponding

1

0.7

0.7

system

37

0.6

2.7

1990

17

0.7

2.0

house

54

0.5

1.9

union

9

0.7

1.0

o’kennedy

5

0.5

0.9

belief

3

0.7

1.0

progressive

24

0.5

1.4

economic

33

0.7

2.8

say

39

0.5

1.3

reform

19

0.7

2.4

issue

27

0.5

1.4

1987

20

0.6

4.0

million

26

0.5

1.6

policy

27

0.6

2.0

printed

2

0.5

0.7

roads

6

0.6

2.6

wealth

6

0.5

1.4

new

38

0.6

1.6

headings

2

0.4

0.4

international

18

0.6

1.5

said

41

0.4

1.6

Table 5: Influential words after feature selection.
Reporting is as described for Table 5.

After excluding stop words and hapax legomena, we were left with a
reduced vocabulary V of 1321 word types. We re-fit the model and
re-scaled the
speeches, computing the influences of the word types in the reduced-vocabulary
model. Table 5 shows the most influential
Government and Opposition words, computed as before.
It is possible that Snowball word list could have missed some influential
function words, but inspecting the words in Table 5 and
the other words further down in the order, we found that this was not the case
for our application. The only suspicious words are say and said,
but in the context of the debate, it makes sense that these words are
pro-Opposition. When the word said gets used, it is typically used
to quote the government (“they said” or “they continue to say”), usually
by an opposition member criticizing the government. Likewise, at first glance
it may seem suspicious that per cent is at the top of the
Government list, but in fact this often used to cite national
statistics about the economy and the GDP, using the state of the economy
explain the unrest.

In principle, it is possible to get standard errors for the affinity estimates
directly from the expected or observed information
function (5). However, these
likelihood-based standard errors are likely too narrow, because they ignore
uncertainty in the estimates of the reference distributions (p1,…,pK), and they rely on the independence assumptions in the model. Ignoring
uncertainty in the reference distribution estimates is inappropriate when the
reference set is small, as it is here (three leadership speeches). Similarly,
the independence assumption—that word tokens in different positions of a
text are independent of each other—simplifies the analysis, but it is likely
violated in real-world data. To accurately assess the uncertainty in our
estimates, we need a method that accounts for the uncertainty in the
reference distribution estimates and the dependence between nearby words in
text.

To estimate the sampling distribution of the scaling estimates under dependence
between word tokens, we will use a block bootstrap that respects the natural
linguistic structure of the text, by following Lowe and Benoit (2013)’s
recommendation to resample texts at the sentence level to simulate sampling
variation but also to capture meaningful dependencies among words within
natural syntactic units. To properly account for uncertainty in the reference
distribution estimates, we will also construct sentence-level bootstrapped
reference speeches. The full procedure is as follows:

For bootstrap replicates b=1,…,B:

For each reference text y1,…,yR construct
bootstrapped reference text y∗b1,…,y∗bR,
where y∗bi has sentences drawn with replacement from
yi, with the same total number of sentences.

Use the bootstrapped reference texts y∗b1,…,y∗bR to estimate the reference distributions
^p∗b1,…,^p∗bK as
described in Sec. 6.

Construct a bootstrap version of the scaled text x∗b by
resampling sentences from x, with replacement.

Use the sample standard deviation of ^θ∗1,…,^θ∗B as the bootstrapped estimate of the standard
error of the affinity scaling estimate ^θ for x.

We performed this procedure for all of 55 non-leadership speeches, getting a
separate bootstrap standard error for each. For comparison, we computed
likelihood-based (Wald) standard error for the estimates from the Fisher
information conditional on the reference estimates.
Unsurprisingly, the bootstrap standard errors are generally wider than the
likelihood-based estimates. The two uncertainty estimates are both on the same
order of magnitude, with the bootstrap standard error being less than 1.5
times as large as the likelihood-based standard error for most of the
speeches (87%); the median ratio of the two standard errors is 1.3.
In the sequel, we use bootstrap standard errors to quantify the uncertainty
in the affinity estimates.

Fig. 4 displays the estimated government affinities for all
55 speeches after performing feature selection. The figure includes 95%
confidence intervals, computed using the sentence-level bootstrap. We discuss
these results in detail in the next section.

At both the level of the government versus opposition and inter-party levels,
the results are entirely in line with expectations: not only are the parties
arrayed in an order that would be consistent with expectations, with
opposition parties on the Opposition side, and the governing parties on the
other, but also we see that speeches from the different parties align with the
extremity of their positions in regards to the establishment. The speeches of
most centrist opposition party, Fine Gael, express a more moderate
anti-Government positions than either the left party Labour or the far-left
Democratic Left party. This median difference emerges clearly even though we
considered the speeches of the Labour and Democratic Left leaders as
equivalent for the purposes of training the Opposition class.

The more interesting distinctions emerge when we examine intra-party
differences in expressed position. Among the government ministers, it is not
surprising to see that John Wilson, the FF Deputy Prime Minister
(Tánaiste, or “FF Tan” in the plot), and
Gerard Collins, the Foreign Minister and a senior Fianna Fáil minister
had extreme Government-oriented
estimated positions exceeded only by the Taoiseach Charles
Haughey himself. What is more interesting is that the next minister in the
estimated ranking, Albert Reynolds, would later become the next
Taoiseach. At the other extreme, among the most Opposition-oriented
government minister we see notable examples in Raphael (Ray) Burke,
who was removed from his ministerial position the following year,
and Mary O’Rourke, who months later would challenge Albert Reynolds for the party
leadership.

The “back-bench” FF members voted with the government but generally gave
speeches that were far more lukewarm than the FF ministers. Correspondingly,
we see that the estimated estimated Government affinities for the
back-benchers are generally lower than those of the minsters. There were three
exceptions, members with extreme estimated Government-oriented affinities: Nolan,
Cullimore, and Cowan. One of these members, Brian Cowen, became Minister
for Labour the following year, and occupied senior positions include Prime
Minister for the next two decades.

On the opposition side, we see a similar set of heterogeneous estimated
affinities. Two salient examples of extreme estimated Government-oriented affinities
are Fine Gael TD Garret FitzGerald, a former and future Prime Minister, and
TD Peter Barry, who had fought Fitzgerald in 1987 for party leadership. Both
emphasized fairly standard economic concerns, attacking the government’s poor
economic performance rather than its corrupt behavior.
It is notable that the member with the highest estimated pro-opposition
affinity,
DL member Pat Rabbitte who would later become leader of the Labour Party;
in his speech, he
engaged in a personal set of attacks against the
Taoiseach and specifically attacking his character and judgment.

The results of applying the class affinity scaling model to the confidence
debate speeches provides a results consistent with expectations and with
previous scholarly investigations of this episode (Laver and Benoit, 2002).
Using only the texts of the speeches, we have succeeded at revealing
differences between the speakers that were not apparent from their party
affiliations.

In our application and in others like it, the correct prediction of a class is
no longer a relevant benchmark because the process of producing political text
is expected to produce heterogeneous text within each class. For us, the
class—here, voting for or against the confidence motion, which was perfectly
correlated with government or opposition status—is observed and uninteresting,
while the heterogeneity is the primary interest. Despite what would seem obvious
from a measurement model or scaling perspective, however, a standard approach in
evaluating machine learning applications in political science has been
predictive accuracy benchmarked against known classes
(e.g. Evans et al., 2007; Yu, Kaufmann and Diermeier, 2008).
This focus on estimating correct classes not only wrongly shifts attention away
from the substantively interesting variation in latent traits, but also may
ultimately impair classification generality by encouraging over-fitting to
reduce predictive error.

Our proposed alternative, class affinity scaling, is based on a
probability model similar to those underlying class predictive methods, but
allows for mixed class membership.
We have shifted focus from
class prediction, something typically uninteresting in the social sciences,
to a form of latent parameter estimation, while retaining
the advantages of supervised learning approaches where the analyst controls the
inputs that anchor the model.
While there is a strong tradition in some
disciplines, such as political science, of adapting machine learning to produce
continuous scales, practitioners are often unaware of
the differences in modeling assumptions between classification and scaling
methods (e.g. Laver, Benoit and Garry, 2003),
or they have not fully explored the implications of these assumptions
(e.g. Beauchamp, 2012).
We have highlighted the differences and similarities
in a form that encourages future development.

The relative simplicity of our method makes it amenable to direct mathematical
analysis. This simplicity allowed us to draw connections between Naive Bayes
classification, dictionary-based scaling, and a host of other methods.
We were further able to exploit the analytic simplicity of the
affinity scaling model to develop an influence measure assessing the
sensitivity of the fit, which we then used to
guide our vocabulary selection and to validate our fits to the Dáil debate.

Using our method to explore the nuances of the speeches in the 1991
Dáil confidence motion, we produced estimates for each speaker that accord
with both
a qualitative reading of the speech transcripts and an expert understanding of
Irish politics.
Our application is a
hard domain problem, where no known lexicographical map exists to
differentiate government versus opposition speech and dictionary-based
scaling, even with a dictionary derived from political text, gives unsatisfactory results.
With limited training from the leadership speeches, class affinity scaling
is able to adapt to the context of the debate and give a meaningful scaling.
The method has
applications far beyond political text, however, and could be used to score more
standard sentiment problems on a continuous scale, or applied to any other
problem for which contrasting reference texts can be identified.

{barticle}[author]
\bauthor\bsnmLidstone, \bfnmG. J.\binitsG. J.
(\byear1920).
\btitleNote on the general case of the Bayes-Laplace formula for inductive
or a posteriori probabilities.
\bjournalTransactions of the Faculty of Actuaries
\bvolume8
\bpages182–192.
\endbibitem

{barticle}[author]
\bauthor\bsnmLidstone, \bfnmG. J.\binitsG. J.
(\byear1920).
\btitleNote on the general case of the Bayes-Laplace formula for inductive
or a posteriori probabilities.
\bjournalTransactions of the Faculty of Actuaries
\bvolume8
\bpages182–192.
\endbibitem