Summary and Keywords

Computational models of human sentence comprehension help researchers reason about how grammar might actually be used in the understanding process. Taking a cognitivist approach, this article relates computational psycholinguistics to neighboring fields (such as linguistics), surveys important precedents, and catalogs open problems.

Computational psycholinguistics is a research area; its goal is “to build models of language that reflect in some interesting way on the ways in which people use language” (Kay, 2005). This article focuses specifically on processing models of human sentence comprehension, as delimited further in section 2.5. But before that, section 2 first differentiates this research area from three other fields. Section 3 then goes on to review the history of the field, identifying several important precedents. On this basis of this review, section 4 concludes by suggesting three open questions for further investigation.

2. Computational Psycholinguistics and Its Disciplinary Neighbors

Computational psycholinguistics has not been institutionalized within academia in the way that some of its disciplinary neighbors have been. This lack of branding can sometimes create confusion. However, newcomers will not go too far wrong if they keep in mind the three distinctions presented below.

2.1 Distinct From Linguistics

The field of Linguistics is the heir to a very long tradition whose success stories include historical linguistics in the 19th century and linguistic anthropology in the 20th. In America, Linguistics as field came together thanks to the energy of organizing figures such as Leonard Bloomfield. As an academic discipline, it is a broad confederation of different subareas all concerned with human language as an object of study. The methodology of this study varies widely, and Linguistics itself is viewed as a standing in the gap between the humanities and the sciences.

Computational psycholinguistics, on the other hand, is exclusively scientific in its orientation. Aspiring to understand how language is used, it does not abstract away from performance factors (see, e.g., Chomsky, 2006, p. 102), but rather considers computational systems that may include or represent a grammar in some way (see Stabler, 1983). The role of time offers a clear comparison between this field and the classic school of Generative Grammar within Linguistics. Typical work in generative grammar addresses sentence structure on its own, apart from the question of how that structure is recognized in time. It considers idealized states of the learner without concern for how individuals proceed from one state to another.1 Computational psycholinguistics is different insofar as time plays a crucial linking role in connecting theory to data. In computational psycholinguistics, but not in generative grammar, one would investigate whether a grammatically permissable inference is in fact drawn at an earlier or later word in a sentence. This might be explained by an automaton theory, whose transitions occur in abstract time. Observed preferences for one structure or another might similarly be interpreted in terms of probabilities, whose definitions relate human experience to the acquired language. These types of theory–data relationships differentiate computational psycholinguistics from linguistics in its unqualified form.

2.2 Distinct From Psycholinguistics

Psycholinguistics was born in 1951, fathered by the Social Science Research Council and explicitly intended as an interdisciplinary synthesis (Osgood, 1980, lecture III).2 Its leading figures were influenced both by information-processing psychology and by generative grammar (Miller & Chomsky, 1963; Bever, 1970). However, as a research area it ultimately found a home within psychology. Since the 1970s, computationally explicit work in psycholinguistics has given way to hypothesis-testing of experimental contrasts that are straightforwardly interpretable without appeal to highly mathematized theories. An exception to this sweeping generalization is connectionism, the school of cognitive science founded on neural networks. Using what would now be called machine learning, researchers in this tradition contributed precise accounts of how children might learn verbal morphology from exposure (Rumelhart & McClelland, 1986). Chater and Christiansen (2008) discuss this tradition at greater length. The biggest difference between bona fide computational psycholinguistics and unqualified psycholinguistics is the embrace of mathematized theory.

2.3 Distinct From Artificial Intelligence

Artificial Intelligence (AI), also inaugurated in the 1950s (Kline, 2011), seeks to automate tasks that would ordinarily be thought of as requiring human intelligence. Since its beginnings, there has been a debate over AI’s positioning as part of science or engineering (Boden, 2006, 13.vii). One approach advocates the mimicry of human cognition, much in the way that the pioneers of flight sought to mimic the flight of birds. This led to an important school of cognitive modeling (see chapter 5 of Hale, 2014, and section 3.7 in this article). However since the statistical revolution of the 1990s, the engineering aspect of AI has become dominant. Computational psycholinguistics continues to borrow ideas from AI, especially from the subfield known as natural language processing, but it contrasts with modern AI by adopting an explicitly scientific, rather than engineering, outlook.

2.4 “Computational” Implies Cognitivist

Like many of the subfields mentioned above, computational psycholinguistics is part of cognitive science. This carries with it an endorsement of the computational theory of mind: that the mind is a computational system, one whose workings we can explore by building models. Kay (1980) illustrates how this mentalistic committment flows through the work by writing:

Suppose one holds, for example, that the grammar that represents a person’s linguistic competence assigns some number of interpretations to a string but that only certain of these will be recognized under conditions of actual performance. Presumably such facts would be explained by positing a specific parsing algorithm or agenda-management policy that gives rise to that algorithm (p. 67).

Kay’s choice of the word explain in the last sentence of the quote is significant. In his example, psychological facts would be explained by an algorithm or formalized search policy, which are all varieties of computational model. This is cognitivism. In such a scientific scenario, cognitivists say that the object of study itself is a form of computation. This designation is meaningful because it invokes a mathematical notion identified by Post, Church, Turing, and others during the 20th century (see, e.g., Haugeland, 2008). Pylyshyn (1989) offers a perennially relevant discussion of the relationship between computational models and observable data. This discussion makes clear that within the broad philosophical outlines of cognitivism, there is still plenty of room for alternative research strategies.

2.5 Sentence Comprehension as an Empirical Domain

Linguistic phenomena at every scale fall within the purview of computational psycholinguistics. The present article restricts itself, however, to sentences (rather than words) as a unit of analysis, and to comprehension (rather than production) as a direction of language use. This restricted domain is still quite rich. Among other phenomena, it includes an extensive catalog of garden-path effects (see empirical chapters of Cowper, 1976; Gibson, 1991; Lewis, 1993).

3. Important Precedents in the Field

Having located the subject matter and delimited its boundaries in section 2, this section offers a rational reconstruction of the field in roughly chronological order. Each subsection introduces one core idea that is important with respect to ongoing work. Because of the interrelationships between these ideas, successive subsections build upon each other.

3.1 Augmented Transition Networks

The augmented transition network (ATN) is the archetypal computer model of human sentence comprehension. ATNs were explicitly offered as simulations of “the process by which human beings recognize the syntactic structure of sentences” (Bratley, Dewar, & Thorne, 1967). They underwrote a conceptual consensus that persisted up through the 1970s, according to Steedman (2011). Because of the various counter-reactions to this consensus, and because certain ideas associated with ATNs are so frequently reproposed in more modern work, they make for an advantageous entry point to the field as a whole.

An ATN, then, is a parsing program that searches for grammatical analyses using “networks” of states connected by arcs. These networks specify allowable word orders by defining sequences of states through which the program must pass in order to finish the analysis. They would be a notational variant of context-free phrase structure grammars were it not for mutable “registers” that may be set and tested in the course of the analysis process. These registers are storage cells that allow ATNs to recover movement relations, such as between filler and a gap in a relative clause. Some version of this mechanism would be needed in any adequate account of human sentence comprehension.

3.1.1 The HOLD Hypothesis About Relative Clauses

The ATN in Figure 1 models certain aspects of relative clause comprehension. In the diagram, circles denote abstract states that can be (but are not necessarily) visited in the course of processing. Arc annotations, such as CAT N, require the machine to find an appropriate syntactic category in the string being analyzed. The instruction SEEK NP causes control to transfer to the initial state of the noun phrase network.

On arc 9 of this network, where a REL PRO is required, an action takes place that loads up the HOLD cell. In an example like (1)

(1)

where the ATN has only gotten as far as the first two words, their analysis would be copied into the register named HOLD. If the string were to subsequently continue as in example (2), where who qualifies as a REL PRO

(2)

then by state S2 the machine would be in a position to again SEEK NP (arc 3) having recognized the verb met. This time, with the HOLD cell full, arc 12 could be taken. This arc bypasses all CAT requirements, allowing the empty string to qualify as a noun phrase. Popping up one level of recursion, the machine returns to arc 10 having recognized a sentence without an object; I shall refer to the resulting analysis as RC for convenience. If the CHECK HOLD test at arc 10 succeeds then RC can be labeled a MOD and the machine can proceed with the remainder of an example like (3), having successfully identified girl as the filler of the gap between met and fixed.

(3)

Textbook presentations go into further detail about ATNs’ operation (Winograd, 1983; chapter 5; Gazdar & Mellish, 1989, §3.7; Charniak & McDermott, 1985, §4.3, 4.4; Covington, 1994, §3.7). The HOLD hypothesis predicts greater cognitive load during the period of time when the ATN’s HOLD register is full, compared to when it is empty (Kaplan, 1974; Wanner & Maratsos, 1978). One can see echos of this idea in later theories such as Gibson’s (2000) dependency locality theory (see also section 3.7).

ATNs were an engineering triumph that facilitiated new kinds of question-answering systems (Woods, 1977). From a scientific point of view, they formalized the view of human sentence comprehension as a search process, one which chooses between syntactic alternatives.

Figure 2 visualizes this conception of comprehension as a sequence of attachment decisions. In this diagram, unlike Figure 1, circles represent parser states that were actually visited in the course of a particular search. Arrows represent alternative parser actions, such as attaching tree nodes or moving on to the next word. At the circle labelled DEAD END no further actions can be taken. Once a sock is attached to mend the model has committed to the locally-attractive but globally-incorrect “garden path” analysis. To continue, search must get back on the path marked with bold line, leading to the GOAL STATE where a sock is correctly recognized as the subject of the next clause (IP for Inflectional Phrase).

On this consensus view, search proceeds under the control of preferences, whose content is essentially psychological. An example is Kimball’s (1973, p. 24) principle “Right Association” which is stated below in terms of terminal (leaf) and nonterminal (internal) nodes of a phrase structure tree.

Kimball’s use of the word optimal in this formulation lends it a sort of least-effort flavor. This principle is “closely related” (Frazier, 1979, p. 75) to another principle, Late Closure, which is a central part of Frazier and Fodor’s Garden Path theory. These parsing preferences can be programmed into ATNs by ordering their arcs, so that certain pathways are explored before others (Wanner, 1980). In Figure 1 for instance, the ordering of arc 9 before arc 8 prioritizes the search for longer NPs with postmodifying relative clauses ahead of simple NPs that lack postmodifiers. Through such ordering, Kaplan (1972) was able to emulate many of Bever’s (1970) Perceptual Strategies, such as the NVN heuristic (Strategy D), which categorizes a postverbal noun as the patient of an action expressed by the verb. A similarly ingenious combination of parsing preferences led Pereira and Shieber to a formalization of Garden Path theory (see discussion in chapter 4 of Hale, 2014).

It should be emphasized that ATNs constitute a cluster of ideas, each of which can be examined separately. They are (a) procedural, (b) multipath, and (c) are related to competence grammars only loosely. Counter-reactions to them took issue with each of these points.

3.2 The Marcus Parser

Marcus (1980) represents a rejection of the “multipath” aspect of the ATN-based research. According to his Determinism Hypothesis, humans do not revise previous syntactic attachment decisions, but rather proceed along just one analysis path without backtracking. This strong claim is intended to apply only to sentences that people can understand without conscious difficulty. Such a neat line gains plausibility from the fact that garden-pathing, where people do experience conscious difficulty, is so rare. Although the Marcus parser did not (to this author’s knowledge) lead to detailed predictions regarding human behavioral experiments, its influence on computational psycholinguistics is hard to overstate.

To build a working model that respects Determinism, Marcus adopted a bottom-up strategy. This means that syntactic attachment decisions could be made after rather than before processing individual words. It contrasts with the top-down approach that was typical of ATNs at the time. Hale (2014, chapter 3) offers an example-based presentation of top-down and bottom-up parsing.

Later work (e.g., Nozohoor-Farshi, 1987) showed that the Marcus parser could be viewed as an LR parser, a standard approach to parsing developed by Donald Knuth in connection with compilers for programming languages (see, e.g., Grune & Jacobs, 2006, chapter 9). LR parsers avoid the need for multiple analysis pathways by delaying attachment decisions until disambiguating material comes along. It seems reasonable to postulate this sort of delay in a cognitive model, under the assumption that the psychological equivalent of backtracking is indeed rare. But this assumption coexists alongside the intuition that a faithful cognitive model ought to include an element of anticipation or expectation for upcoming words and phrases.3 This leads back to ATNs, where top-down parsing offers a natural formalization of expectation. No one particular resolution of this dilemma has prevailed in computational psycholinguistics overall.

3.3 Logic Programming

By the late 1970s, researchers in several different communities rejected the “procedural” aspect of ATNs in favor of an alternative “declarative” approach. This switchover was fueled by the success of logic programming in natural language processing (e.g., the Chat-80 system as described by Warren & Pereira, 1982). Such approaches were dramatically more successful than parsing systems designed around the idea of inverting transformational derivations (briefly summarized in the introduction to Berwick & Fong, 1995). Several new grammatical theories arose at this time, with Lexical-Functional Grammar (LFG) being perhaps the clearest example of the procedural-to-declarative switchover. As one of LFG’s cocreators put it,

An ATN establishes the underlying representation through a particular left-to-right sequence of register operations, each one applying to the results of previous operations. LFG and other feature-structure grammars instead determine [emphasis added] the appropriate structures as those that satisfy a system of constraints or conditions on the various combinations of features and values.

A declarative grammar such as LFG simply constrains well-formed representations, rather than providing instructions for rewriting trees in a derivation or setting registers in an automaton. The “determination” that Kaplan refers to happens indirectly, as lexical properties conjoin with principles of grammar to yield just a small set of satisfying grammatical structures. Figure 3 presents an example of this sort of “determination.”

This figure shows the constituent and functional structure of a topicalized English sentence as analyzed in LFG (Bresnan, Asudeh, Toivonen, & Wechsler, 2015, p. 68). In the righthand panel, Figure 3b, the curved arc visualizes the fact that the topic of the whole sentence is constrained to be equal to the object of the verb “like.”

This constraint, expressed by the equation ((x↑)TOP)=↑ at the bottom of Figure 3a, is not an instruction but rather an equality. There is no RETRIEVE HOLD action to be executed, as in ATNs. Rather, grammars are collections of constraint equations to be solved by generic procedures such as SLD resolution (Pereira & Warren, 1980). This “order-free” arrangement results in a “process-neutral” grammar that is in principle compatible with many different processing algorithms (Shieber, Schabes, & Pereira, 1995; Sag & Wasow, 2011).

Although processing concerns were not among the explicit motivations for theories of Government and Binding (GB) (e.g., Chomsky, 1981, 1986), increasingly declarative formulations led to important advances in this tradition as well. Fong and Berwick (1991) evaluated the efficiency of testing grammatical principles in alternative orders. Johnson (1991) explored parsing algorithms that had no direct parallel with those previously studied in compiler theory. And Abney (1989) introduced a type of grammatically motivated parser action called Licensing. A prototypical example of Licensing uses theta theory, the subsystem of GB having to do with semantic roles. If the grammar requires that all words’ roles must be filled, then given a lexical entry containing a list of such roles (the “theta grid”), a parser is licensed to build new syntactic representations that will eventually fill each one. One might interpret such an action as a precompiled rule, consistent with some given set of logical formulas. However, those formulas’ declarative interpretation affords a parsing preference, too. Pritchett (1992, p. 12) formulates such a preference below, evoking the declarative interpretation with the phrase “be satisfied at every point”.

Theta Attachment The theta criterion attempts to be satisfied at every point during processing given the maximal theta grid.

Pritchett’s monograph extends this basic idea, turning many other GB principles into licensing relations and parsing preferences suitable for incremental parsing. The idea was refined yet further and evaluated crosslinguistically by Crocker (1996) and Merlo (1996).

3.4 The Competence Hypothesis

The ATN is related to transformational grammar, but only loosely. It recovers information that might have been specified in an Aspects-style deep structure, but it does so without explictly unwinding transformations or constructing a derivation. The question then arises regarding what, if anything, a competence grammar contributes to scientific explanations based upon such a loose-fitting model?

From the outset, there had been an expectation that the linguist’s grammar ought to contribute something. Chomsky expresses this hope in writing:

No doubt, a reasonable model of language use will incorporate as a basic component, the generative grammar that expresses the speaker-hearer’s knowledge of the language.

Bresnan and Kaplan (1982) dub this the Competence Hypothesis, initiating an important line of work on what it might mean for a model to incorporate a generative grammar “as a basic component.” Their own contribution, LFG, came at a time of widespread skepticism about generative grammars in psychology. An alluring hypothesis, known as the Derivational Theory of Complexity (DTC) was written off as having been unsupported by psychological experiments (Fodor, Bever, & Garrett, 1974, chapter 6, part 2). The DTC is the idea that the count of transformational rules applied in the competence-theoretical derivation of a particular structure should index behavioral difficulty, as measured in experiments with human comprehenders. The central disappointment of these experimental results was that counts of transformational rule applications—such as passivization, relativization, or dative movement—had failed to correlate with measured difficulty across a variety of experiments conducted during the 1960s.

LFG, by contrast, does not have transformational rules. Passive voice is handled in the lexicon. Filler–gap relationships, as in relative clauses, are handled with an equality statement that holds at just one level of representation. Under these assumptions, Bresnan and Kaplan argue (p. xxiv), complexity effects in a reasonable model of language use would reflect just surface structure—not transformational derivations. In this way the Competence Hypothesis could be sustained, in a substantive way, but only by adopting LFG rather than (classical) transformational grammar.

Berwick and Weinberg (1984, pp. 56–65) offer an alternative resolution: Rather than changing the grammar, change the processing architecture. They proceed to endow the Marcus parser with the ability to execute two actions at once. Under this architectural assumption, both the Standard Theory and the Extended Standard theory could (in principle) be compatible with the DTC results. The Marcus parser stands in a relationship of “type transparency” with transformational grammar. Under this view there could be two parser actions that together do the job of recognizing the grammatical relationships specified by a single transformation.

Of course, the stronger view, that grammars should bear a one-to-one relationship to the steps performed by processing algorithms, retains a broad appeal. Under this “Strong” Competence Hypothesis, as it came to be called, the psychological theory of word-by-word language comprehension and the linguistic theory of sentence structure ought to coconstrain one another.

3.5 Incremental Interpretation

Steedman (1989) argues that Strong Competence motivates Combinatory Categorial Grammar (CCG) over other theories of grammatical competence. The argument turns on “incremental interpretation”—the delivery of semantic representations for initial sentence fragments (for more see Chater, Pickering, & Milward, 1995). For instance, the lambda term in example (4b) from Steedman (2000) corresponds to boldfaced initial substring of the sentence in (4a).

(4)

In this formula, def′ is a definiteness operator that selects a contextually salient member from a set of individuals satisfying some property, and sb′ is a Skolem constant representing the unknown agent of the sending action (“somebody”). It would be natural to use these sorts of incremental interpretations to define meaning-related parsing preferences. Such preferences would explain the discourse-dependent suppression of garden-pathing documented by Crain and Steedman (1985), among other phenomena.

CCG is able to assign interpretations incrementally, Steedman contends, because of its flexible approach to constituency. There are so many CCG derivations that all yield the same semantic representation that it is often possible to find one that is highly left-branching.4 This makes for a very simple incremental parsing program that just composes constituents and proposes the topmost semantic representation as output. Sentence types that would have called for complex extragrammatical machinery in an ATN are handled straightforwardedly in such a CCG processor.

This argument provoked a debate that remains highly instructive for proposers of word-by-word comprehension models. Stabler (1991) offered an initial response to Steedman. He suggests that syntactic and semantic processing could be intermixed in a “nonpedestrian” search strategy that allows incremental interpretations to be deduced as soon as possible. This allows a grammar-writer to retain right-branching derivations. Stabler also offers an encoding of predicate–argument relationships that delays semantic composition, compared to the simplest CCG processor. This part replies to Abney’s (1989) suggestion that LR parsing is incompatible with incremental interpretation.

Shieber and Johnson (1993) continued the debate. While they agree with Stabler that syntactic and semantic processing can be intermixed, they suggest dropping the presumption that a parser should synchronize syntactic and semantic processing at constituent boundaries. Rather, processing at both levels should have the freedom to proceed asynchronously. To make this work, Shieber and Johnson draw attention to the set of derivations that remain in play at successive initial fragments. In an LR parser, these sets will necessarily have a regular structure. For instance, a modifier might be constrained to attach to a VP even while the embedding level of the VP itself remains unknown. Using this information fully, one can propagate syntactic indeterminacy all the way through to semantics even before a constituent boundary.

Steedman (2000, pp. 244–245) acknowledges these replies and the technical solutions that they contain. CCG is no longer the only formalism that supports incremental interpretation. However, Steedman emphasizes that proposed technical solutions are just that: technical. CCG’s flexible constituency, on the other hand, is not a mere technicality. It is independently motivated by facts about coordination, ellipsis, and other syntactic phenomena. If one adopts CCG on linguistic grounds, then one gains an elegant approach to incremental interpretation as well.

3.6 Generative Capacity

If a processing model is tightly linked to a grammar, as it is under the Strong Competence Hypothesis, then it is pertinent to ask how powerful that grammar is. Can rules within the grammar formalism derive any possible word order? Or are there some mathematical limits to the sorts of patterns that may be described? This is important because processors will naturally inherit these limits.

Levelt (2008) includes a new postscript containing a brief, accessible sketch of key background material in this area; see especially pages 2–5. As Levelt explains, there were no limits to the generative capacity of classical transformational grammars of the 1960s (Peters & Ritchie, 1973). This ability to derive anything was indeed inherited by ATNs, which typically included an escape hatch to run Lisp code. Such an escape hatch undermines any sort of general claim about the computational complexity of human language processing. But since the 1970s, a tradition proceeding from the work of Aravind Joshi has pursued a more restrictive approach. This tradition arrives at what Stabler (2013, §17.2) calls a “hidden consensus” regarding the generative capacity needed to property characterize natural language.5 Several different formalisms, including Joshi’s own Tree-Adjoining Grammars (TAG), Steedman’s CCG, and Stabler’s Minimalist Grammars (MGs) all are adequate to describe the most challenging word-order generalizations attested in real human languages, while remaining only “mildly” context-sensitive. According to the hidden consensus, these restricted grammar formalisms are better hypotheses about grammar inside the head than more general formalisms based on, say, tree rewriting or feature unification as in Figure 3.

A key argument for restrictive grammar formalisms turns on data from Germanic languages. In languages like Dutch it is possible to form embedded sentences by clustering verbs such as “see” “want” or “teach” in a cross-serial rather than nested fashion (for a textbook presentation, see chapter 2 of Kallmeyer, 2010). If these cross-serial dependencies are a truly productive aspect of even one attested human language, then context-free grammars and their associated pushdown-stack automata are ruled out.6 Indeed, Bach, Brown, and Marslen-Wilson (1986) report offline data suggesting that Dutch crossed dependencies are even easier to comprehend than their German translations with corresponding nested dependencies (see Kaan & Vasić, 2004; Kobele, Gerth, & Hale, 2012, for online data). Joshi (1990) presents a processing model that derives this processing asymmetry. Naturally, this model is based on TAG. This achievement underlines the attractiveness of mildly context-sensitive grammars in models of human sentence comprehension.

Kobele et al. (2013) and Graf, Monette, and Zhang (2017) continue this line of work by evaluating processing models based on MGs. They derive well-known processing asymmetries, for instance between cross-serial and nested verb clusters, and between subject-extracted and object-extracted relative clasues, on the basis of exactly the sorts of complexity metrics pioneered by Joshi. These metrics quantify how long a grammatical representation would need to be retained in memory during processing by an automaton model. This represents a generalization of the HOLD hypothesis introduced in the earlier ATN work.

3.7 Architecture

“Mild” context-sensitivity, in the sense discussed above in section 3.6, is a leading candidate among proposed formal universals—something about language that may be true across all humans. An alternative approach toward this same goal starts from the primitives out of which sentence processing models are built. The term cognitive architecture refers to these primitives. A particularly clear example is Soar, an architecture which Allen Newell offered as candidate Unified Theory of Cognition (1990). Using Soar, Lewis (1993) defines a sentence processing model whose operation matches human performance across a wide variety of phenomena, such as garden-path sentences. The grammatical representations in this model were inspired by theories of Government and Binding (see section 3.3). However the control regime was simply the Soar system itself: an AI program that sequentially chooses operators to transform one state of a problem into another one that is hopefully closer to the goal. The important point is that Lewis’s model assimilates attachment decisions to the Soar decision cycle. In this model, ambiguous attachments for pieces of phrase structure are just another variety of Soar “impasse.” They are dealt with using the full power of Soar, such as to reflect on its own lack of knowledge and to “memoize” the results of such deliberation for the future. This memoization, or chunking as shown in Figure 4 from Lewis, was instrumental in bringing one particular model up to human-level reading speeds.

Of course, there are Soar models of other parts of human cognition as well. Explanatory success in the realm of language thus strengthens the overall case for Soar as unified theory (for highlights see Lewis, 1999). Hale (2014, chapter 5) presents a general way of compiling phrase-structure grammars into Soar models, yielding processors that respect the Strong Competence Hypothesis. These same models can then take advantage of architectural facilities like reinforcement learning (see Hale, 2014, chapter 6).

Besides a decision procedure, cognitive architectures typically also specify memory: how many different types there are, the nature of information that can be stored in them, the time-course of retrieval from them, etc. These aspects of memory are obviously relevant to human sentence comprehension, and they receive computationally explicit answers within the ACT-R architecture. ACT-R is the latest result of a 40+-year research project integrating realistic memory models into a broader computational theory of mind (Anderson & Bower, 1973; Anderson, 1983; Anderson & Lebiere, 1998; Anderson, 2007). Leveraging this work, Lewis and Vasishth (2005) apply ACT-R to human sentence comprehension. Their processing model makes sense of a wide variety of center-embedding phenomena, among other data points. The core of the proposal is the idea that grammatical dependencies motivate memory retrievals. An example would be the relationship between a subject NP inflected for person and number and an agreeing matrix verb. Theorizing these memory retrievals in ACT-R, one can work out detailed predictions about the time-course of human sentence processing. This model inspired many subsequent investigations of other linguistic dependency types. Parker, Shvartsman, and Van Dyke (2016) identify open questions within this tradition.

3.8 Information-Theoretical Complexity Metrics

With the rise of statistical natural language processing in the 1990s, probability came to prominence in AI and in cognitive science more generally. Against this backdrop, Hale (2001) introduced surprisal as complexity metric for human sentence processing difficulty. A word has high surprisal if it its probability is low given its left context, on some explicit probability model. The term itself dates back to Tribus (1961) as a proposal about the information value of an observed event. Levy (2008) shows that surprisal can be derived from Kullback-Leibler divergence, another information-theoretical notion.

As Hale (2016) explains at greater length, surprisal led to a resurgence of computational psycholinguistics because it links probabilistic grammars and probabilistic parsers to experimentally-observable quantities such as reading time. For instance, Boston, Hale, Vasishth, and Kliegl (2008) and Demberg and Keller (2008) both show how surprisal values from incremental parsers can improve regression models of eye-fixation durations. These numbers are positively related to the amplitude of the N400 component of event-related potentials on the scalp (Frank, Otten, Galli, & Vigliocco, 2015), and as well as to blood-oxygen level-dependent signals from across the brain’s language network (Brennan, Stable, Van Wagenen, Luh, & Hale, 2016; Henderson, Choi, Lowder, & Ferreira, 2016).

Surprisal is not a freestanding theory of human sentence comprehension, but rather a way of linking probabilistic models of language to observable data. In combination with eye-tracking corpora, it has been used to explore a wide array of ideas: limited memory (Wu, Bachrach, Cardenas, & Schuler, 2010), parallel processing (Boston et al., 2011), connected tree-structure (Demberg, Keller, & Koller, 2013), and even the denial of tree structure (Frank & Bod, 2011; but see Fossum & Levy, 2012 and van Schijndel & Schuler, 2015).

Nor is surprisal unique among information-theoretical metrics. Another metric, entropy reduction supposes that greater processing effort should be observable when a person goes from a state of grammatical uncertainty to one where there is greater certainty (Hale, 2003b). This metric correlates with electrical signals measured intracranially from people who are in the hospital awaiting treatment for epilepsy (Nelson, Dehaene, & Hale, 2017). Entropy Reduction has done a better job than surprisal in accounting for the difficulty profile in sentences containing long-distance dependencies such as relativization (Hale, 2003a, 2006; Yun, Chen, Hunter, Whitman, & Hale, 2015). Of course, the underlying grammar must be expressive enough to define these relationships. Yun et al. (2015), for instance, use Minimalist Grammars (see section 3.6) to express a transformational account of relativization. This approach acknowledges the consequences of all viable analyses at a word, apart from any architectural constraints that might limit their number or impose a sequential ordering upon them. This exemplifies how incremental complexity metrics can be used to derive idealized difficulty levels directly from a probabilistic grammar with minimal assumptions regarding cognitive architecture.

4. Open Problems

The rich conceptual heritage of computational psycholinguistics gives researchers a head start on many challenges.

4.1 Neural Realization

Sentence comprehension happens inside the brains of real people. By applying models of human sentence comprehension, it should be possible to offer an increasingly fine-grained functional characterization of how the process works. One approach relies on naturalistic texts which participants hear or read while neural signals are collected. This “naturalistic” approach offers modelers a rich time course of observations, compared to self-paced reading or eyetracking (for reviews, see Brennan, 2016; Murphy et al., forthcoming).

4.2 Differences Between Languages

As section 3.7 suggested, a computational model can differentiate between the shared architecture of cognition and language-particular knowledge. Cross-linguistic comparison brings this distinction forward in a striking way, a point that has been appreciated by functionalists and formalists alike (MacWhinney & Bates, 1989; De Vincenzi & Lombardo, 2000).

Computational psycholinguists can contribute here with multilingual process models. A simple example would be the automaton version of Grillo and Costa’s (2014) “pseudo-relative first” heuristic in chapter 4 of Hale (2014). In this case, what initially seemed like language-specific differences turn out to be subsumable under the rubric of “do the most specific thing,” a basic principle of many cognitive architectures.

Analogous formal work could help resolve other apparent differences between languages, such as regarding what appears to be Subject preference across constructions and languages (Bornkessel-Schlesewsky & Schlesewsky, 2015; Longenbaugh & Polinsky, 2017).

4.3 Form vs. Meaning

No fixed theory of syntactic preferences does very well at explaining human attachment decisions in corpora. As Marcus (1980) recognized, there must be semantic influences on human parsing. But what are they? Figure 5 shows his own suggestions from more than 35 years ago. Now, advances in broad-coverage parsing, including standardized meaning representations help bring this question closer to scientific tractability (Ambati, Deoskar, Johnson, & Steedman, 2015; Artzi, Lee, & Zettlemoyer, 2015). Tractability in this sense might mean quantifying the gap between ambiguity resolution methods purely based on form and those that take into account specific meaning-related considerations that can be represented in a computer.

Berwick, R. C., & Fong, S. (1995). A quarter century of computation with transformational grammar. In Linguistics and computation (pp. 103–143). Stanford, CA: Center for the Study of Language and Information.Find this resource:

Demberg, V. (2012). Incremental derivations in CCG. In Proceedings of the 11th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+11) (pp. 198–206). Conference held in Paris, September 2012. Retrieved from http://www.aclweb.org/anthology-new/W/W12/W12-4623.Find this resource:

Hale, J. A. (2001). Probabilistic Earley parser as a psycholinguistic model. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics.Find this resource:

Notes:

(1.)
This “atemporal” characterization of generative grammar corresponds to the Formalist view laid out in §2 of Phillips & Lewis (2013).

(2.)
Psycholinguistics, of course, has had many lives. Blumenthal (1970, chapter 5) argues it should rightfully be viewed as continuation of an earlier German tradition. Levelt (2013) explores these historical antecedents, prior to the cognitive revolution.

(3.)
Both Stabler (1991) and Shieber & Johnson (1993) seek to reconcile LR parsing with some notion of incremental interpretation, as discussed in section 3.5.

(4.)
There do seem to be some sentence types where standard CCG does not afford a completely incremental derivation. However, these exceptions do not impugn Steedman’s main point: that incremental interpretation comes far more naturally in CCG than in other approaches to grammar (Demberg, 2012).

(5.)
Stabler’s footnote 5 acknowledges a few exceptions to this consensus, cases such as Chinese number names, where an adequate natural language grammar might need to be more than mildly context-sensitive.

(6.)
This basic view, where pushdown automata are taken as mathematical models of parsing, is presented in chapter 3 of Hale (2014), among other places.