Tag Archives: Numbers

“Numberiness”

David Kurnick

“We can indeed count” words, Eric Bulson observes, and concludes that therefore “the counting must go on” (4). The reasons to move from the first remark to the second will not be self-evident to everyone. But “Ulysses by Numbers” gives an unprecedentedly intimate sense of Joyce’s compositional practice, offering not just a fascinating picture of how Ulysses grew but also an account of why it grew in the increments it did. Perhaps the most surprising discovery here for Joyce scholars is the fact that, as Bulson puts it, “even after serialization stopped, Joyce was still writing by the numbers” (26): even released from the 6,000-word increments suggested by Pound for the novel’s serial installments, Joyce kept creating at scales of 6,000. It turns out that “Circe,” which seems to obey no rules save the volcanic logics of the unconscious and Joyce’s own ambition, is dutifully designed to fit into eight installments of The Little Review. Figure 9, where you can see this finding visualized, offers a startling picture of genius in compromise with the materiality of publication.

Bulson thus indisputably helps us get a sharper sense of how “the serial logic of length” (6) conditioned this particular masterwork. Accordingly, my questions about his essay are less about the findings themselves than his account of them, and they concern the charisma that the rhetoric of number itself exerts in the essay. Surely Bulson’s most provocative claim is that his method will help us get at Ulysses’ “numerical unconscious” (4). The formulation suggests an opaque but determining structure whose revelation will be decisive for our sense of the meaning of the whole. And Bulson does tend to connect number with causality in just this way. “More words on the page but fewer seconds passing in the plot: that is a discovery Joyce made while writing Ulysses” (19). This can’t really be said to be a discovery, though, since Joyce could have learned that discursive time affects diegetic time from (to pick a name not quite at random) Homer, who interrupts a classic action-movie moment—an arrow whizzing by Menelaos—with a startling simile about Athena deflecting it “the way a mother / would keep a fly from settling on a child / when he is happily asleep”[i]: the words take longer to read (or to hear recited) than an arrow to miss its mark, and even longer if you pause to think about them. And “more words” is only one way texts slow down story-time: arcane or boring or made-up words can achieve a similar end with relative verbal economy, as can disorienting shifts in point of view, or a lot of jokes, or odd images. Every attempted reader of Finnegans Wake knows that the number of words on the page has relatively little to do with how long it takes to read that page and how much time it seems is passing in the “plot” as you do so (if I had to quantify, I’d say that word count in the Wake isn’t even the half of it).

It’s not that word count is irrelevant to narrative pacing. But its status as the factor driving Joyce (or any other writer) in a particular novelistic project needs to be established. My sense is that Bulson is drawn to number because we have new and powerful tools to help us count with relative facility—and he has used those tools with precision and ingenuity. But we might be wary of installing the facts those tools let us assemble as the engine of textual construction; this is the methodological metalepsis Pierre Bourdieu identifies when he warns against “giving as the source of agents’ practice the theory that had to be constructed in order to explain it.”[ii] To put it simply: is number essentially operating in Bulson’s argument as a metaphor for length? We could formulate Bulson’s signal discovery about the pace of Ulysses’ growth in two ways: a) Joyce’s earliest episodes average 5,233 words, and later the average jumps to 11,179; b) with “Scylla and Charybdis” Joyce starts writing episodes at double the length he’d agreed upon with Pound, thereby facilitating the publication of the later episodes over two installments of The Little Review. The information referred to by each sentence is identical, but my sense is that the specifics of the first version would be news to Joyce, while he’d readily acknowledge the second, amused that the professoriat has finally caught up with him.

This doesn’t in itself argue for the priority of either version: literary scholarship is under no obligation to limit itself to insights that would have occurred to literary producers, especially scholarship aiming for the unconscious determinants of literary forms. But what makes 5,233 versus 11,179 a more compelling way to describe a relation between literary objects than twice as long? It seems that the appeal of the more precise version derives from the charisma of quantification more generally (as, I suspect, does the temptation to describe that precise version as exercising the shaping power of the unconscious). Ours is a numbery historical moment, and that numberiness has a marked technicist (and digital) bent.[iii] But to fathom the status of “the numerical” in Joyce, one would want to know, in addition to the numbers themselves, what conceptions of the numerical Joyce was working with, and what ideas of number may have been working through him. Does it matter to our sense of number in Ulysses that one of the pioneering efforts to map word frequency in the English Bible was published by the Reverend J. Knowles (who was developing a system to teach blind people to read) in 1904, the year in which Joyce set his novel? Or that one of the major advances in this same field—Edward Thorndike’s ranking of word frequencies in a corpus comprising 10,000 words—appeared in 1921, in the hiatus between Ulysses’ run in The Little Review and the publication of the completed book version? (Would Joyce have known or cared about either of these events?)[iv] What did “6000 words” mean to a word-processorless writer—a painstaking tally, a rough calculation based on the number of manuscript pages, a guess? (Did Joyce or Pound ever literally count anything)? And how did Joyce use these experiences of number to play with (or ignore) the rhetorics of number operative in his moment? With a writer as deliberately self-revolutionizing as Joyce, this will get complex fast: it’s not only that our numberiness is different from his, but that the degrees and meanings of numberiness varied over the course of his work.

These are not Bulson’s questions, but my suspicion is that his findings would become most resonant when seen in their context. To answer such questions we would need an ecology of number in Joyce—one that would account, in A Portrait of the Artist, for the “thousand times” Stephen Dedalus feels he has yielded to Ellen’s charms as well as the “ten thousand idolators” baptized (according to the catechizing rector at Clongowes) by Saint Francis Xavier. The rhetoric of number in Portrait reaches its peak in the hellfire sermon, where we are asked to imagine the walls—“four thousand miles thick”—that pen in the damned, as well as the stench emanating “from the millions upon millions of fetid carcasses massed together in the reeking darkness”; and the novel famously concludes with Stephen’s resolution to “encounter for the millionth time the reality of experience.”[v] Even this quick review makes clear that number in Portrait operates on a decimal system, the zeros intensifying what Joyce thus encourages us to understand as a unified vital energy: eroticism (“a thousand times”), religion (“four thousand”), and artistic vocation (“the millionth time”) are divisible by one another. We are still essentially here within a lyricism of number, number as expressivist amplification device.

Things are altogether different by the time we reach the numberiest of Ulysses’s sections: “Ithaca,” the penultimate episode, proceeds by alternating a series of questions about the ongoing action with the madly technical responses those questions elicit from the narrator (Joyce called the episode’s method “mathematical catechism”). The joke of “Ithaca” is the yawning discrepancy between the abstraction of the language and the experiential texture of the events it narrates. Nowhere is the effect more evident than in the answer to the question, “What relation existed between [Stephen and Bloom’s] ages?” Bloom is 38 in Ulysses, Stephen 22, but naturally the narration will not say so thus straightforwardly. Instead Joyce spins out a mathematical series: starting from 1883, the first year when the ratio comprised by their ages calculated in years was expressible in a non-infinite form (i.e., when Bloom was 17 and Stephen was 1), the narrator points out that had Stephen aged normally from that year while the ratio between their ages had somehow remained constant, by the 1904 in which the novel transpires, Bloom would be 374 (that is, Stephen’s age of 22 multiplied by the factor of 17 that separated them in 1883). By the time Stephen reaches 70 (in 1952), Bloom would be 1190 years old; and if Stephen were himself in turn to reach that fantastic age, Bloom would have to be 83,300 years old, “having been obliged to have been born in the year 81,396 B.C.”[vi]

Does it matter that Joyce’s numbers don’t work? (By my calculation, when Stephen reaches 1190, Bloom would be 20,230 …). Barry McCrea, commenting recently on these errors, has read them as indicating that “the world is neither perfectible nor fully describable … [T]he mistaken calculations serve as a reminder that this is a novel” and not a mathematical formula.[vii] His point is supported by the fact we can make “novelistic” sense even of the outlandishly large and faux-precise numbers: the increasing gap between the men might stand as a figure for Bloom’s sweetly anxious protectiveness toward Stephen, or of Stephen’s self-absorbed inability to imagine himself on the same time-scale as his companion. The very counterfactual ground of the thought experiment, whereby one man ages normally while the other outpaces him at ever more fantastic rates, captures the temporal warpings subtending any relation between reader and story (having first read Ulysses at 22, I will always feel that Leopold Bloom is older than me by a factor of about 1.7; now, unaccountably, I am 1.105 times older than him.) Most mysteriously, the coldness of these calculations doubles as a form of tenderness: the number-crunching of “Ithaca” might be parodying our humanistic orientation to literary character, but it does so by embarrassing us into a solicitude on behalf of literary character (by episode 17, Stephen and Bloom have been engineered to appear to exceed any numerical account of them.)

All of which is to say that Joyce’s episode offers both a micro-history of and a commentary on the coming-to-being of our number ecology, insisting on the headiness of its sublime technicism even as it conjures its most intensely felt reality effects. Joyce’s straddling of these two aesthetics might be one way to describe his continued interest for historians of the novel: Ulysses’ self-constitution as a professional-object-in-waiting makes it feel at once perfectly suited to and faintly mocking of the most technically precise accounts we might offer of its workings. But the suitability and the mockery both remain to be read, as do the relations between them. So, yes, “the counting must go on”—so long as we agree to remain unsure about what the counts mean.

[iii] For incisive recent contributions to the discussion of method in the digital humanities, see Alan Liu, “The Meaning of the Digital Humanities,” PMLA 128.2 (2013): 409-423, and Andrew Goldston and Ted Underwood, “The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us,” New Literary History 45.3 (2014): 359-384. Both articles address the interpretive question as it pertains to much larger corpuses than Ulysses, but their discussions of how meaning does or does not inhere in quantitative digital methods raise interesting questions even for what Bulson calls the “single data set” constituted by Joyce’s novel (6).

[iv] J. E. De Rocher, The Counting of Words: A Review of the History, Techniques, and Theory of Word Counts (New York, 1973), 5-9.

[v] James Joyce, A Portrait of the Artist as a Young Man (1914; New York: Penguin, 2003), 72, 115, 128, 130, 275-6.

“A Hail of Information”: Ulysses, Topic Modeled

Hoyt Long and Richard Jean So, University of Chicago

What can a quantitative analysis of style tell us about James Joyce’s Ulysses? Quite a lot, according to Eric Bulson. In his “Ulysses by Numbers,” Bulson uses some of the simplest forms of “stylometrics”—word counts and measures of lexical diversity—to provide new insights into some fundamental questions: why do the novel’s episodes get longer? What’s the relationship between an episode’s length and its plot? Bulson productively correlates the concrete evidence given by word counts with questions of composition and the material constraints of serialization. While the straightforward empiricism of his argument is a strength, it left us to wonder what it misses by treating words as homogenous numerical units abstracted from their semantic contexts. But not because we believe numbers and counting are unsuited to an interpretation of the novel. One of Bulson’s great insights is that counting is hardly alien to the project of reading Ulysses, an insight encapsulated in an epigraph from Hugh Kenner (“‘Words’ are blocks delimited by spaces. So we can count them.”). For us, the question is how to push this counting further. Can we count the words in ways that do not elide their contextual signifying power? Kenner too was interested not just in the number of words on the page, but the likelihood of certain words appearing with others, in what he called “space-time block[s] of words.”[1]

As quantitative approaches to text analysis have evolved, they have similarly shifted from counting words to counting collocations of words, and even collocations of collocations. One popular innovation along these lines is probabilistic topic modeling, which we propose here as a method for exposing what Kenner calls Ulysses’s larger “verbal systems.”[2] What we discover in the process is in part obvious—that topic modeling as a method of counting is also constrained by its assumptions about words as numerical units and their relation to each other. Ulysses troubles these assumptions, which amount to a highly particular theory of information. Precisely because it does so, however, topic modeling the novel also reveals something of how the novel functions as its own form of literary information. If word counts help us understand Joyce as a “mechanical counter,” topic models help us understand him as a careful “arranger” of latent verbal structures.[3]

Topic modeling was developed over a decade ago by computer scientists hoping to aid in tasks like “information retrieval, document classification, and corpus exploration.” A major enhancement on prior methods of automated document comparison, it was intended to “discover the themes that run through” a large collection of texts without any advanced knowledge of the texts themselves.[4] While this unsupervised approach was initially applied to the exploration of scientific and news articles, literary scholars have been quick to adopt its most common implementation—which employs latent Dirichlet allocation (LDA)—to track the “migratory formulae” of literary history in thousands of novels; to explore patterns of discourse in critical literature or across multiple types of corpora; and even to find themes in highly figurative poetic language.[5]

What topic modeling is good at is identifying words that occur together in multiple places across multiple documents. It connects words that tend to appear in similar contexts while helpfully distinguishing between uses of words that have multiple meanings. These clusters of co-occurring words are the “topics” produced by the model. For instance, if a topic model were applied to a large corpus of scientific articles, it might find that “matter” and “energy” frequently appeared together in a subset of those articles. Considered alone, these words are ambiguous and would not help the human reader intuit what the articles are about. But the topic model also returns additional contextual clues—words like “particle” and “dark”—thus allowing us to say that these articles probably have something to do with physics, as opposed to energy policy.

This idea that coherent topics like “physics” are latent within clusters of co-occurring words naturally relies on a set of ontological assumptions about what a topic is, but also what a document (or “text”) is and how it is generated. Topic modeling operates on the assumption that it can use the words it observes in documents to infer the “hidden structure” of topics that likely generated those documents.[6] Thus it assumes that there are topics that already exist in the world, like “physics,” and that individual words, such as “neutron,” are probabilistically associated with these topics. It then treats every document as if it were composed from some proportion of these topics, with the words in each topic more or less likely to be chosen based on their statistical distribution within that topic. So, for example, the model will say that a document contains a 50 per cent share of a topic because it contains lots of words frequently associated with that topic, like matter, energy, particle, and neutron. Based on these words, we could then infer that this document most likely has something to do with “physics,” even if it also contains smaller shares of other topics. The topic model essentially reverse engineers the process of composition based on the set of documents it is given, returning three things: lists of words for every topic it infers; the probability of those words being associated with that topic; and the proportion (or share) of a topic present in every document.

If there is any quantitative method that can relate the “blocks of words” in Ulysses to some larger verbal system, topic modeling would seem to be it. Imagine segmenting the novel into hundreds of smaller blocks—about 500 words each—and building an algorithm that infers a hidden structure of 60 topics from these blocks, which here represent our set of “documents” or “articles.” The assumption here would be that each block is composed from a subset of these 60 topics, and that words appeared in the block based on an existing association with those topics (and the words likely to appear within them). On the one hand, this is a preposterous way to consider how Ulysses was written. Surely Joyce did not draw on a limited set of topics as he wrote. And surely he did not write with fixed associations about which words were more likely to belong to which topic. By any measure, Ulysses is a “hail of information”—a spigot of words in which information itself takes precedence over narrative. Yet it is difficult to imagine this information being arranged as coherently as topic modeling would assume. That said, and as critics have long argued, the novel clearly possesses some kind of “hidden plan.”[7]

Indeed, when we ran a topic model on Ulysses, we saw that there was a latent structure in the patterns of co-occurring words, but that this structure was meaningful only to the extent it warped the core assumptions of LDA. For instance, we found that of the 539 total word blocks in the novel, less than a fifth had more than a 20 per cent share of any one topic. Instead, most contained a small share of many different topics, suggesting a lack of topic consistency within each block. This means that Ulysses has difficulty sticking with any single topic. If we look at the top five topics in each block, things get even more interesting. What we find is that the model identifies topics that cohere at the level of each canonical episode, even after we have excluded grammatical function words and character names.[8] All the blocks from “Sirens” and “Ithaca,” for example, huddle around a limited number of topics (Figure 1), meaning that they more or less draw on similar clusters of words. This visualization confirms what we already know from literary scholarship: that Ulysses is generally organized by “episodes,” each possessing its own loosely coherent form of language, style, and theme. In a sense, the topic modeling algorithm replicates the work that scholars such as Stuart Gilbert and Frank Budgen have done in revealing the novel as constituted by discrete “episodes,” thus making Ulysses’s overall “schema” relatively comprehensible to the reader.[9]

Figure 1. One way to visualize the relation of our “blocks of words” to the 60 inferred topics is to create a network diagram. In this case, we began with a network that linked all 539 text blocks to their five most prevalent topics (numbered). We then filtered out all blocks except for those in the “Sirens” and “Ithaca” episodes to show how little overlap there is in their highest ranked topics. The complete network of blocks and topics, along with a list of all 60 topics and their highest ranking words, is available at our website: chicagotextlab.uchicago.edu.

While such confirmation is useful, topic modeling also exposes patterns of language and meaning that have remained hidden to scholars. Consider Topic 35, which is highly associated with the words street, passed, corner, past, bridge, and walked. We might describe this topic as a “walking” topic, or more broadly, an “urban-spatial” topic. A graph showing the fluctuation of this topic across the novel tells us that it peaks in the “Wandering Rocks” episode (Figure 2). This again confirms what we know. This episode, as its most literal level, narrates the physical meanderings of several characters, such as Father Conmee, across the urban landscape of Dublin. The words associated with this movement are what the topic model has picked up on. Yet if we look at the topic’s overall distribution, what is surprising is its sustained presence across multiple earlier episodes, such as “Hades” and “Aeolus.” We know that both are framed by Leopold Bloom walking to or from a specific location, but the strength and persistence of the topic across and through these episodes is nevertheless striking.

Figure 2. A visualization of the relative topic share of “Topic 35” in each 500 word block of Ulysses. The points on the graph indicate the exact share for each block. The line represents a running average of these values.

Few scholars would say that a main organizing theme of “Hades” is its urban landscape. More typically, it is concepts, such as death and memory (particularly Bloom’s recollection of his father’s suicide), that are said to animate the form of the episode.[10] Discrete words like “street” or “car” are seen as mere scaffolding for this larger thematic action. What the topic model results suggest, however, is that this scaffolding may have a larger symbolic function as part of a cluster of words related to movement. Consider a passage from “Hades” that the model identifies as thickly materializing Topic 35: “Mr Powers choked laugh burst quietly in the carriage. Nelsons pillar … We had better look a little serious, Martin Cunningham said. Mr Dedalus sighed. Ah then indeed, he said, poor little Paddy wouldnt grudge us a laugh. Many a good one he told himself. The Lord forgive me! Mr Power said, wiping his wet eyes with his fingers. Poor Paddy!” What seem to be inconsequential markers of space or movement at the level of the individual passage (here captured in the word “carriage”), are exposed as the verbal supports of more significant themes, like death. The topic model suggests that there is a latent, but necessary spatial-conceptual link reinforced within the episode. But also, and this is its most intriguing contribution to a re-evaluation of the novel’s serial origins, across episodes.

In connecting word counts to the history of the novel’s serialization, Bulson also finds something interesting about the “Hades” episode. It is here that the episodes start getting longer. The reason, he speculates, is because the physical world of the narrative also began to “expand,” with interactions between characters becoming more complex. As a result, Joyce needed more words. It is with “Hades” that Joyce began rethinking the overall design of his work, imagining each future installment as potentially longer than the last, and thus transforming the kinds of things he could write about. What topic modeling adds to this argument is a way to see not just how serialization was changing Joyce’s attitude about how many words he could write, but in how many different combinations he could write them. Given the chance to pack in more and more information, we have to wonder how he decided to organize that information differently. Looking at Topic 35 again, it seems significant that Joyce was laying the groundwork in “Hades” for one part of a verbal system that would become greatly extended in “Wandering Rocks.” What might other topics reveal about the evolution of Joyce’s “blocks of words” as serialization of the work progressed?

These are the kinds of questions that will most likely appeal to literary critics and validate the felt usefulness of numbers as an interpretive method. More work, naturally, is needed to take these questions further. But Bulson’s essay, as well as our brief extension of his project, point to a new interpretative model for the modernist novel. Some will continue to insist that word counts or empirical models do violence to readings of Ulysses: such texts do not signify meaning via the quantity or collocations of words, but through the attention of individual human readers to the words on the page. We would not quarrel with that. Ulysses does not signify through word counts and topic models, but it can still be known through them. Indeed, the history of scholarship has left us a view of the novel as a mass of words that needs to be classified and schematized. As Kenner puts it, the novel is a “hail of information” that can be “retrieved and systematized,” and only through this labor do we “know a fraction of what we may think we know.” Bulson has shown how as simple a process as counting words can make new sense of this “hail.” With topic modeling, this project can be extended even further, inviting Ulysses into a much longer history of information theory. Not, however, because Joyce’s text can be construed as “information” in ways commensurable with current quantitative or computational methods. But precisely the opposite. Its incommensurability with the ontology of these methods exposes the continuities and ruptures that the novel shares with the contemporary information age and the tools that seek to order it. It is within this zone of continuity and discontinuity that new forms of reading will emerge.

[7] Kenner, 23. See also Derek Attridge, “Introduction,” in James Joyce’s Ulysses: A Casebook (Oxford: Oxford University Press, 2004), 3-11.

[8] Before running the topic model, we excluded all grammatical function words as well as all personal names and all character names. For the personal names, we used Jockers’s expanded stopword list. See Macroanalysis, ch. 8.

[9] See Attridge, 10-11, for an excellent review of this scholarship and what it accomplished.

[10] See Michael Seidel, Epic Geography: James Joyce’s Ulysses (Princeton: Princeton University Press, 2014), 161, for just one recent example.