Around the wordhttps://corpling.hypotheses.org
A corpus linguist's research notebookMon, 25 May 2020 07:29:20 +0000fr-FR
hourly
1 https://wordpress.org?v=5.3.2Plotting collocation networks with R and ggraphhttps://corpling.hypotheses.org/3341
https://corpling.hypotheses.org/3341#respondTue, 19 May 2020 09:21:20 +0000https://corpling.hypotheses.org/?p=3341In a previous post, I showed how to use the igraph package for R to plot a graph of the nominal collocates of the verbs hoard and stockpile in the Coronavirus Corpus (Davies, 2020). In this post, I show how to do the same with the ggraph package.

ggraph

The grammar of graphics as implemented in ggplot2 is a poor fit for graph and network visualizations due to its reliance on tabular data input. ggraph is an extension of the ggplot2 API tailored to graph visualizations and provides the same flexible approach to building up plots layer by layer.

Although presented as an improvement upon ggplot2, ggraph also plays the role of an igraph add-on to bring the flexibility of the tidyverse grammar and more graphics settings into it. Naturally, because of its native tidyverse coloration, ggraph is well suited to the logic of tidygraph.

Walkthrough

We download, install, and load three packages: igraph, ggraph, and tidygraph.

The output says that we have a graph that is undirected (U) and named (N) with 154 nodes and 202 edges. The vertices have coll_freq (i.e. collocation frequency) as a quantitative attribute, and the edges are weighted with MI (i.e. mutual information).

For the sake of illustration, let us compare the igraph object to its tidygraph counterpart:

tg is a tibble graph (tbl_graph). As the tidygraph documentation indicates:

The tbl_graph class is a thin wrapper around an igraph object that provides methods for manipulating the graph using the tidy API. As it is just a subclass of igraph every igraph method will work as expected.

geom_edge_diagonal(alpha = .2, color='white'): this draws edges as diagonal bezier curves rather than straight lines. A level of transparency is added (alpha = .2) and the edges are printed in white (color='white');

geom_node_point(size=log(v.size)*2, color=colorVals): the size of each node is indexed on the log of collocation frequency times two (size=log(v.size)*2) and the color of each node is indexed on its centrality score (color=colorVals).

]]>https://corpling.hypotheses.org/3341/feed0Plotting collocation networks with R: ‘hoard’ vs. ‘stockpile’ in the Coronavirus Corpushttps://corpling.hypotheses.org/3300
https://corpling.hypotheses.org/3300#respondMon, 18 May 2020 11:06:06 +0000https://corpling.hypotheses.org/?p=3300This post is a follow-up to the previous one on graph theory and corpus linguistics. I show how to plot a graph of collocation networks with R and the igraph package. The case study focuses on the nominal collocates of two near-synonymous verbs in the brand new Coronavirus Corpus: hoard and stockpile.

Graphs are linguistically relevant

A graph consists of vertices (nodes) and edges (links). In Fig. 1a, each circle is a node and each line is an edge. Each edge denotes a relationship between two nodes. The relationship is symmetric, i.e. undirected.

The edges of a graph may be asymmetric, i.e. have a direction associated with them. The graph in Fig. 1b illustrates this asymmetry. This graph is directed.

Fig. 1. Two basic graphs: (a) undirected; (b) directed

In a study on collocations, words are going to be the nodes and edges are going to stand for their co-occurrence. The attributes of a graph may be assigned linguistically relevant features. For example, the frequency of a constituent has a correlate in the importance of the node: frequent nodes may be represented with a larger size than infrequent nodes. The co-occurrence frequency of at least two nodes has a correlate in the number of edges between nodes: frequent co-occurrence can be visualized by means of either multiple edges or one edge whose thickness is indexed on frequency or some association metric.

The collocates of hoard and stockpile in the Coronavirus corpus

On May 15th, Mark Davies announced the release of the Coronavirus Corpus (Davies, 2020). Here is an excerpt from the official announcement, which I received by email:

The Coronavirus Corpus is designed to be the definitive record of the social, cultural, and economic impact of the coronavirus (COVID-19) in 2020 and beyond, and it is part of the English-Corpora.org suite of corpora, which offer unparalleled insight into genre-based, historical, and dialectal variation in English.

The corpus is currently about 270 million words in size, and it continues to grow by 3-4 million words each day. (For example, there are already 4 million words of text for yesterday, May 14). At this rate, the corpus may be 500-600 million words in size by August 2020.

Mark Davies, 05/15/2020

I decided to use this brand-new and still-growing corpus to compare the collocates of two near-synonyms: hoard and stockpile. These two words appeared in the early days of the COVID-19 crisis, as people around the world started panick-buying toilet paper and other survival-related items. My goal is to plot a graph of the relations between the near-synonyms and their nominal collocates.

The data consist of the top 100 nominal collocates of the verbs hoard and stockpile in the Coronavirus Corpus on May 17, 2020. The dataset is available for download below.

igraph

I will be using the igraph package because it is flexible, and thoroughly documented (Kolaczyk and Csárdi 2014; Arnold and Tilton 2015). Note that other interesting packages exist: network (Butts 2008, 2015), tidygraph (Pedersen, 2019), and ggraph (Pedersen, 2020).1

First, install and load the igraph package.

rm(list=ls(all=TRUE))
install.packages("igraph")
library(igraph)

Next, two files are needed: (a) an edge list and (b) node attributes. An edge list is a list of connections between the nodes. Node attributes (or vertex attributes) list all the nodes and their properties. We load them.

We have two attributes: the name of each node and the number of times it is found as a collocate of hoard and stockpile (coll_freq). Hoard and stockpile are also listed in the file, along with their overall frequencies. The names will be used to label the nodes, and the frequencies to adjust node sizes.

e.list is the edge list. Each row stands for an edge in the graph. Each cooccurrence of hoard/stockpile and an object noun is signalled by an edge. The first column contains the origins of the edges (from) and the second column the end points (to).

Among other things, the output says that we have a graph that is undirected (U) and named (N) with 154 nodes and 202 edges. In igraph‘s idiom, V stands for ‘vertex’ and E for ‘edge’. We index the size of each node on collocation frequency,

v.size <- V(G)$coll_freq

and we assign a label to each node.

v.label <- V(G)$name

We also specify that mutual information is used to weight the edges of the graph.

E(G)$weight <- E(G)$MI

One crucial aspect of graphs is centrality. Graph centrality is a measure of how important a node is in the context of the entire graph. Here, it will be applied to detect the most prototypical collocations. Arguably, the three most popular measures of centrality are:

degree centrality (nodes are ranked according to the number of edges to which they are connected),

eigenvector centrality (nodes connected to important nodes are assigned a higher weight), and

betweenness centrality (nodes are ranked according to how many pairs of nodes linked by the shortest path they are connected to).

We shall use eigenvector centrality to spot the most influential nodes. Eigenvector centrality is calculated with the evcent() function.

eigenCent <- evcent(G)$vector

The function outputs a list. We are interested in the vector element of the list. It is a vector that assigns each node a numerical score between 0 and 1. Let us take a look at the first twenty nodes.

Each bin is now assigned a color along a rainbow continuum skewed to the reds (for higher scores) and yellows (for lower scores). The color values are assigned to the color attributes of the nodes. The plotting function will shade the nodes according to their eigenvector centralities.

Next, we need to select a layout. I choose the Fruchterman-Reingold layout.

l <- layout.fruchterman.reingold(G)

A layout is an algorithm that defines the shape of the network graph. Although there are tons of layouts to choose from (see https://igraph.org/r/doc/layout_.html), I like using the Fruchterman-Reingold layout for plotting collocation networks involving near-synonyms because it captures the force-dynamics at work when attractions and repulsions are at play. This idea is summarized below:

The Fruchterman-Reingold Algorithm is a force-directed layout algorithm. The idea of a force directed layout algorithm is to consider a force between any two nodes. In this algorithm, the nodes are represented by steel rings and the edges are springs between them. The attractive force is analogous to the spring force and the repulsive force is analogous to the electrical force.

Fig. 3. A graph of hoard, stockpile, and their nominal collocates in the Coronavirus Corpus.

Among other things, the graph allows you to see that:

The nodes that correspond to hoard and stockpile are roughly the same size, an indication that they collocate with the same number of tokens given the top 100 types that were extracted from the corpus;

the shared collocates in the middle of the graph are characterized by high centrality scores and are strongly specific to the COVID-19 crisis. They belong to the following classes:

No such clear specialization emerges among the most central nouns that are specific to stockpile (sns, high-grade, sinograin, recyclables, overkill, armaments, ness). However, among the less central nouns, we find terms that are representative of the concerns that are specific to the COVID-19 crisis: respirators, PPE (personal protective equipment), antibiotics, chemicals, gowns, warehouse(s), frenzy, etc. This is not surprising given the (intentional) thematic bias of the corpus.

The division of labor between hoard and stockpile depicted above is typical of near-synonyms. We learn as much about each synonym by inspecting the collocates they have in common as by inspecting their distinctive collocates.

Tips

When you work with the igraph package, there are some things you need to know. First, for some unknown reason, if you do not switch off R between two graphs, you may end up with some strange result, including when you enter rm(list=ls(all=TRUE).

Second, the vertex size and the labels will often be way too large. To avoid that, on top of log-modifying the sizes, enclose the plot call within a postscript graphics device as follows:

The above prints the graph on a giant virtual sheet (height=60, width=60). If this does not work, modify the height and the width until you reach a satisfying result. Also, do not hesitate to fiddle with the igraph arguments, especially edge.width, edge.arrow.size, and vertex.label.cex.

Third, each time you call the plot function, this will generate a graph with identical relative distances but the overall node positions will be slightly different. To prevent this from happening, and for reproducible results, choose a number (any number) and include it as an argument of set.seed()(e.g. set.seed(17)) before generating the graph object.

As you will soon realize, plotting network graphs involves its share of heuristic manipulations. But once you have found the right combination, the result is gratifying!

ggraph is based on igraph but the former improves upon the latter with respect to visualization.

]]>https://corpling.hypotheses.org/3300/feed0Graph theory and corpus linguisticshttps://corpling.hypotheses.org/3265
https://corpling.hypotheses.org/3265#respondWed, 13 May 2020 12:24:21 +0000https://corpling.hypotheses.org/?p=3265This short post is the first of a series on network graphs for corpus linguistics. Because of the COVID19 pandemic, such graphs have been in the spotlight in the last few months for their ability to illustrate and explain how and how fast an infection spreads across a population. Because I am not an epidemiologist, I will merely show to what extent network graphs can prove useful to corpus linguists, starting with collocation networks.

I became interested in networks after I read Albert-László Barabási’s popular science book Linked: The New Science of Networks (2002). Because corpus linguistics is arguably the study of the distribution of words, and because I had read about the networks of words and constructions ‘in the mind’ before, I found that network graphs had a major role to play in my field.

Graph theory: where it all started

The origin of network science is graph theory, whose origin dates back to 1736, when mathematician Leonhard Euler solved a famous problem: the Seven Bridges of Königsberg. Fig. 1 is a map of Königsberg in 1736.

Fig. 1. A map of Königsberg in 1736

The Pregel River partitions the city into four parts:

Kneiphof Island (B)

3 districts (A, C, and D).

These parts are connected by 7 bridges (numbered from 1 to 7).

The problem is to devise a route around the city that would cross each of those bridges once and only once. Euler diagrammed the city in such a way that each of the four districts was a node and each bridge an edge (Fig. 2).

Fig 2. The map of Königsberg transposed as a graph.

Thanks to this graph, Euler showed that such a route does not exist. The nodes which are connected to an odd number of edges can only be starting points or end points. An uninterrupted walk from start to finish crossing all bridges can only have one starting point and one finish point. For that reason, this walk cannot be plotted on a graph with more than two nodes that are connected to an odd number of edges. Because the graph in Fig. 2 has four nodes connected to an odd number of edges (7), the route does not exist.

In 1875, a new bridge was built between A and C. This reduced the number of nodes linked to an uneven number of edges to two. It became possible to solve the problem of the Seven Bridges of Königsbergand. What we still ignore to this day is whether the bridge was built to allow for the problem’s resolution.

The small world of language phenomena

Arguably, the most influent offshoot of Graph Theory is the Erdős–Rényi random-graph model (Erdős & Rényi, 1959). In a random graph, all the nodes are equally likely to be connected by edges. Their distribution is random (Fig. 3) and likely to follow a bell-curve (Fig. 4).

If you are familiar with language-related phenomena, you already know that such a random distribution is unlikely to happen because « language is never, ever, ever, random » (Kilgarriff 2005).1 Indeed, words do not occur in a sentence or a text in a random fashion.

The distribution of words in natural languages is characterized by Zipf’s law (Zipf, 1949). It holds that in a corpus of naturally-occurring utterances, the frequency of any word is inversely proportional to its rank in the frequency table, according to the power law, which it derives from. A Zipfian distribution is characterized by a large number of rare units.

By way of illustration of Zipf’s law, Fig. 5 displays the distribution of words for each of the eight text types found in the British National Corpus. The shape of each curve is Zipfian.

Fig. 5. The (Zipfian) lexical distributions for the eight text types in the BNC.

Words have been found to co-occur along the lines laid out by ‘small world’ networks (Cancho & Solé 2001). As observed by biologists, physicists, and social-network experts, and summarized by Watts & Strogatz (1998), ‘small-world’ networks have two distinctive features :

they are very dense;

the average distance between two nodes is short.

Fig. 6. shows what a typical ‘small-world’ network looks like.

Fig. 6. A small-world graph

The above has been verified with lexical behavior (Drieger, 2013). To put it simply, words interact in a dense fashion with their (close) environment.

In a future post, I will show how graph theory can be used to provide a graph-based representation of collocation networks. In more concrete terms, I will show how to plot network graphs in R using the igraph and ggplot2 packages.

Words and infections (short addendum)

In Linked, Barabasi explains how a disease (just like fads) spreads in a network in a small-world. How fast an infection spreads depends on the presence of ‘hubs’, i.e. members of a network that have a number of connections to other nodes that is highly above average. In epidemiology, these are known as ‘super spreaders’ (e.g. Gaëtan Dugas in the case of AIDS in the early 1980s).

References

Cancho, Ramon Ferrer i & Ricard V. Solé (2001). « The Small World of Human Language ». In: Proceedings of The Royal Society of London. Series B, Biological Sciences 268, p. 2261–2265.

Recent distribution semantic models such as word2vec rely on a bag-of-word approach. Although NLP engineers are ok with this, linguists find it questionable.

]]>https://corpling.hypotheses.org/3265/feed0Towards a distributional construction grammarhttps://corpling.hypotheses.org/3055
https://corpling.hypotheses.org/3055#respondThu, 13 Feb 2020 14:11:57 +0000https://corpling.hypotheses.org/?p=3055In 2017, I was appointed as Junior Research Fellow to the Institut Universitaire de France for five years (2017-2022). The goal of this post is two-fold. I am now halfway through my 5-year research project, and I would like to take this opportunity to invite colleagues and prospective PhD students to collaborate on this project. Although I cannot offer fully-funded PhD positions, I can nevertheless provide substantial funding for specific missions having to do with the project. If you are interested, please send me a cover letter and a CV (prospective PhD students, read this post to the end!). The list of available work packages is appended to this post (see below).

Summary

Vector-based distributional semantics holds that words occurring within similar contexts are semantically close and that meaning can be represented by means of distributed vectors, which record lexical distribution in linguistic contexts. Vector-based models have been directed at representing words in isolation to the detriment of complex expressions. I extend word-centered vector-based models to the representation of complex constructions. I use state-of-the-art distributional semantics techniques to develop models that compute syntactically contextualized semantic representations. Following the claim that constructions acquire their meaning from their prototypical constituents, I hypothesize that the meaning of constructions is derived from the distributional preferences of their constituents.

The project (longer version)

The project description below is as I wrote it in early 2017. Some advances have been made since!

1. Multiword expressions (MWEs)

Multiword expressions (MWEs) are strings of two or more lexemes that are idiosyncratic in some respect. Such complex strings are frequent. Sag et al. (2002) estimate that 41% of the entries in WordNet 1.7 are MWEs. MWEs assume a wide range of forms such as institutionalized phrases and clichés (love conquers all, no money down), idioms (kick the bucket, sweep under the rug), fixed phrases (by and large), compound nouns (black and white film, frequent-flyer program), verb-particle constructions (eat/look/write up), light verbs (have a drink/*an eat, make/*do a mistake), named entities (San Francisco), lexical collocations (telephone box/booth/*cabin, emotional baggage/*luggage), etc. MWEs are easily mastered by native speakers. Yet, their linguistic status is still problematic and their interpretation still poses a major challenge for NLP techniques due to their heterogeneous nature.

The grammatical status of MWEs has been an issue at least since the “rules vs. the lexicon” debate (Langacker 1987; Pinker 1999; Pinker and Prince 1988; Rumelhart and McClelland 1986). Because rules capture all the regularities in language, MWEs should have no place in the grammar proper because they are lexical. Because the lexicon consists of words or morphemes, it does not include MWEs because they are phrasal. Jackendoff (1997, chapter 7) advocates the inclusion of “phrasal lexical items” (i.e. “lexical items larger than X0”) in the lexicon. An alternative, although related, solution proposed by construction grammar approaches delegates MWEs to a “constructicon” (Goldberg 2006, p. 64). Grammar consists of a large inventory of constructions, varying in size and complexity, and ranging from morphemes to fully abstract phrasal patterns (Goldberg 2003). Yet, not all constructionist theories agree as to how grammatical information is stored in the constructicon. Four taxonomic models are recognized to exist. In the full-entry model, information is stored redundantly at various levels of the taxonomy. In the usage-based model, grammatical knowledge is acquired inductively, speakers generalizing over recurring experiences of use. In the normal-inheritance model, constructions with related forms and meanings are part of the same network. In the complete inheritance model, grammatical knowledge is stored only once at the most superordinate level of the taxonomy. At this stage, these taxonomies are mostly theoretical constructs.

2. Goal #1 – model the constructicon

My first goal is to propose a corpus-based framework to test the validity of the constructicon and ultimately decide which construct is the most empirically plausible. Ideally, constructions should be detected from large corpora and assembled in a network based on their forms and their contextual meanings. This is no easy task. What distinguishes MWEs from other complex expressions is that, even though they consist of existing words with standard syntax, they are idiosyncratic at the lexical, syntactic, semantic, pragmatic, and/or collocational levels. Constructions of adjectival intensification in English are a good case in point (Desagulier 2014, 2015a,b,c). For example, quite is likely to be interpreted as a maximizer when it modifies an extreme/absolutive adjective (this novel is quite excellent) or a telic/limit/liminal adjective (quite sufficient), but it is likely to be a moderator when it modifies a scalar adjective (quite big) (Paradis 1997). Yet, context dependency is not always decisive. For example, quite is ambiguous between a maximizer and a moderator when it modifies the adjective different (Allerton 1987, p. 25). This kind of issue is what still makes the automatic, large-scale interpretation of MWEs an insuperable challenge for state-of-the-art machine learning techniques (Sag et al. 2002).

However, recent advances in deep learning and neural networks leave room for hope in the field of corpus-based models of semantic representation. These models are known as distributional semantic models (DSMs). They are computational implementations of the distributional hypothesis: semantically similar words tend to have similar contextual distributions (Harris 1954; Miller and Charles 1991). In DSMs, the meaning of a word is computed from the distribution of its co-occurring neighbors. The words are generally represented as vectors, i.e. numeric arrays that keep track of the contexts in which target terms appear in the large training corpus. The vectors are proxies for meaning representations. However, even the best distributional-vector representations are limited by their current inability to detect MWEs and represent the non-compositional meanings of phrases.

3. Goal #2 – improve existing DSMs

My second goal is therefore to combine my linguist’s expertise and my programming experience in corpus linguistics to improve existing DSMs so that they can learn better semantic representations of constructions. In the sections that follow, I adress specific issues and ways of solving them.

4. Multiword constructions

I have worked extensively on MWEs as constructions, i.e. multiword constructions. In Desagulier (2014), I use techniques in quantitative corpus linguistics to cluster quite, rather, fairly, and pretty based on their statistical associations with the adjectives that they modify. In Desagulier (2015b), I extend the methodology to study of the predeterminer vs. preadjectival alternation with respect to quite and rather. Functional similarities and differences between these four intensifiers are approximated via their selectional preferences. Co-occurrence counts are submitted to association measures, whose scores are explored thanks to exploratory multifactorial techniques such as (multiple) correspondence analysis. In these studies, I bypass the issue of non-compositionality by infering meaning from clusters of significant intensifier-adjective collocations. Ideally, the context-dependent meaning of these constructions should be assessed directly.

In Desagulier (2015a), I focus on A as NP (thin as a rake, black as pitch, white as snow, etc.). Despite the pairing of an identifiable syntax (adjective + as + NP) and a specific reading (“very A”), Kay (2013) considers A as NP “non-constructional” and “non-productive” because (a) knowing the pairing is not enough to license and interpret existing tokens (especially when there is no obvious semantic link between the adjective and the NP, as in easy as pie), and (b) speakers cannot use the pattern freely to coin new expressions. Two idiosyncrasies further block A as NP from qualifying as a construction according to Kay. First, some expressions are motivated by a literal association between the adjective and the NP (tall as a tree, white as snow) whereas others hinge on figurative associations between A and NP, including possible puns (safe as houses), and yet others are the sign that the NP has grammaticalized to intensifying functions (jealous as hell, sure as death). Second, some expressions are compatible with a than-comparative (flat as a pancake > flatter than a pancake) whereas others are not (happy as a lark > ??happier than a lark). Such idiosyncrasies provide evidence that the constrution’s tokens are not generated by a rule, which makes their automatic extraction from a corpus difficult. However, the same idiosyncrasies do not prevent A as NP from being productive and from forming a consistent network.

5. Construction networks

The idea that constructions are stored in a network fashion is present in landmark works in Construction Grammar (e.g. Goldberg 1995, p. 67). Recent advances in the application graph theory have made it possible to plot network graphs of constructions.

A graph consists of vertices (nodes) and edges (links) whose attributes may be assigned linguistically relevant features. From a usage-based perspective, frequency is recognized to be one of the most central factors in the construction of linguistic representations. Such representations are influenced by how often speakers are exposed to language events. The more often we experience an event, the stronger its entrenchment in memory and the faster its mental accessibility. The frequency of a constituent has a correlate in the importance of the node (frequent nodes are more important than infrequent nodes). The co-occurence frequency of at least two nodes has a correlate in the number of edges between nodes (frequent co-occurrence is visualized by means of either multiple edges or one egde whose thickness is indexed on frequency). Recent research in co-occurrence has shown that collocations may be asymmetric (Ellis 2006; Gries 2013). The same can be said about the relation that holds between form and meaning, or between one or several construction slots and the whole construction (Desagulier 2015a). Likewise, the edges of a graph may have a direction associated with them and be asymmetric.

Fig. 1 is based on the study of the A as NP construction (Desagulier 2015a).

Figure 1: A graph of asymmetric collostructions between adjectives and NPs in A as NP (adjectives are in red, NPs in blue)

It is a graph of the asymmetric collostructions between the adjective slot and the NP slot (Desagulier to appear, Section 10.7.2.2). Among other things, the graph shows that some types are somewhat fixed as per the adjectives or NPs that they collocate with (large as life, honest as the day is long, gauche as a schoolgirl, thin as a rake, taut as a bowstring, etc.) On the other hand, some other types are more productive. They are part of more complex combinatorial constellations of adjectives and NPs (towards the center part of the graph). These complex networks are based on hubs, i.e. constituents that are connected to several other nodes, e.g. the adjectives white, cold, clear or smooth or the NP hell. We can also see that some attractive constituents are themselves attracted by other constituents. For example, the adjective sure attracts the NPs death and night follows day. At the same time, it is attracted by the NP hell.

One problem with the above network is that it is a post-hoc visualization based on observed co-occurrence counts. As such, it has no predictive value. Another problem is that it contains no semantic information. This information is inferred from the linguist’s interpretation. Ideally, the counts should be weighted for context informativeness. In concrete terms, the adjectives and the NPs should be semantically annotated, as well as their specific combinations. This is where semantic vectors provide an added value.

6. Distributional semantics models

DSMs are not new to linguists, at least on the NLP side (Baroni et al. 2014; Padó and Lapata 2007). Vector space models of word co-occurrence have been applied to tasks such as synonymy detection, concept categorization, verb selectional preferences, argument alternations, etc. What is new is their dramatic improvements thanks to deep learning and neural networks.

6.1. Principles and applications

Two methods have proved very successful in learning high-quality vector representations of words from large corpora: word2vec (Mikolov, Chen, et al. 2013; Mikolov, Yih, et al. 2013) and GloVe (Pennington et al. 2014). Based on neural networks, they (a) learn word embeddings that capture the semantics of words by incorporating both local and global corpus context, and (b) account for homonymy and polysemy by learning multiple embeddings per word. Once trained on a very large corpus, these algorithms produce distributed representations for words in the form of vectors.

As a principal investigator to a Partenariat Hubert Curien (#32168XF, 2014–2015) between Paris Nanterre University and Ajou University (Suwon, Korea), I trained GloVe on the British National Corpus to detect lexical proximities in the context of sentiment analysis (Desagulier 2016). Fig. 2 shows the 10 nearest neighbors of the adjective ironic based on cosine distance as a proximity measure between vectors.

Figure 2: Nearest neighbors to ironic in the BNC

These neighbors include synonyms, antonyms across several categories (adjectives, adverbs, nouns, and verbs). Although the corpus is relatively modest in size (100 million word tokens) and the matrix of vectors is small (50 dimensions), the vectors are surprisingly consistent, implying that the “meaning” of ironic has been captured satisfactorily.

As a follow up on Chambaz and Desagulier (2015) and Desagulier (2015b), I used GloVe to tag the adjectives intensified by quite and rather in the BNC. Because at the time I did not have access to a server that was powerful enough to train the algorithm on a very large corpus, I extracted the vectors corresponding to the adjectives in my data frame from a set of vectors pre-trained with GloVe on the Common Crawl database (http://commoncrawl.org/). Table1 is a snapshot of the resulting data frame.

Figure 3: A 3D visualization of the nearest neighbors of indispensable based on GloVe and Common Crawl

6.2. Remaining challenges

The first challenge is the detection of multiword constructions. Suppose you investigate quite and rather constructions. The typical solution is to treat the MWE as “words-with-spaces” (Sag et al. 2002) and concatenate the words in the syntactic pattern in which they are found: e.g. quite /rather _a vs. a_quite /rather . Then, you determine the vector profile of the whole phrase. Although this might work for intensifiers, it will fail to detect light verbs (have a drink, have a go, *have an eat, etc.) for example. The erratic selectional preferences of light-verb constructions cause a lexical proliferation problem in their detection by dramatically skewing the ratio towards recall at the detriment of precision. To accommodate phrases in a vector-space model, Mikolov, Sutskever, et al. (2013) propose a detection technique that consists in subsampling frequent words. For example, closed-class words such as determiners and prepositions easily occur millions of time in any large corpus. Such words are generally considered meaningless with respect to rarer open-class words. This subsampling technique should be handled with care so as not to discard the closed-class words that are often part of idiomatic constructions (e.g. at in congressman/editor at large, or the in kick the bucket).

The second challenge has to do with context, which, even in recent word-vector algorithms, is defined as a small window of words surrounding the target word. It is assumed that all context words contribute to the target word, irrespective of syntax and long-distance dependencies. However, the assumption that contextual information contributes indiscriminately to the meaning of a phrase is linguistically limited. To enrich vector-based models with morpho-syntactic information, I suggest handling the syntactic templates of multiword constructions by first targeting supervised learning on a thesaurus of pre-identified constructions.

Once a satisfactory operationalization of context has been found, and providing context resolves ambiguities, there remains a third, most important issue: (non-)compositionality (Padó and Lapata 2007). Let W1 and W2 (e.g. red and tape) be the two lexical constituents of a nominal compound N (red tape). The syntax-dependent composition function yielding a nominal compound, adapted from Mitchell and Lapata (2010) and Dinu and Baroni (2014), should be:

\vec{N} = f_{comp} (\vec{w_1}, \vec{w_2}),

where \vec{w_1} and \vec{w_2} are the vector representations associated with W_1 and W_2.

Dinu and Baroni (2014) and Mikolov, Sutskever, et al. (2013) have found that composition can be defined as the application of linear transformations to the two constituents by summing up their respective vectors:

f_{comp} (\vec{w_1}, \vec{w_2}) = \vec{w_1} + \vec{w_2}.

I intend to test the above formula by applying it to constructions. The syntax-dependent composition function yielding a multiword construction \vec{C} becomes:

\vec{C} = \vec{c_1} + \vec{c_2},

where \vec{c_1} and \vec{c_2} are the vector representations associated with two constituents of C.

Of course, I do not believe that the issue of (non-)compositionality can be resolved by one equation. A significant share of the project will be dedicated to joint seminars with mathematicians and linguists on the best way to handle the problematic capture of the semantic subtleties of multiword constructions, notably in one of the monthly seminars at my lab.

7. Interdisciplinary goals

My project is interdisciplinary. It combines expertise in linguistics, mathematics, and computational engineering. Its has applications in a wide variety of fields such as theoretical linguistics, lexicography, the digital humanities (especially text mining), machine translation, and data analysis. After benchmarking DSMs on a set of pre-identified constructions (specifically intensifying constructions), the models will be applied to detect more complex, yet unseen constructions (including but not limited to light-verb constructions and argument-structure constructions).

Want to join me?

Let me know if you are willing to work with me on one of the following work packages. I will update the list regularly.

a. R packages

I am working on two R packages. The first package, constR2vec is an R interface to the detection and vectorization of multiword constructions from a corpus. The second package, constRucticon is meant to make network graphs of multiword units based on frequency counts, association measures, and vectors. The goal is to make the two packages work together.

b. Visualization

I plan to visualize construction networks by means of the tools from graph theory. The data consist of edge lists, vertex attributes, and word vectors.

c. supervised vector estimation

The first step consists in framing the vector estimation problem as a supervised task. This is done by targeting the machine learning on repositories of pre-identified constructions such as those proposed by Pattern Grammar (Francis et al. 1996; Hunston and Francis 2000)

d. unsupervised vector exploration

Once vector-based machine learning on a database of pre-identified multiword constructions has proved satisfactory, the methodology can be applied to detect these constructions in very large corpora of English. This second step is unsupervised.

The algorithm that I intend to write in R will use the outcome of supervised training as a basis for construction detection. Whether a new multiword sequence counts as a construction will be decided thanks to a semi-parametric method from biostatistics known as targeted learning (van der Laan and Rose 2011).

e. Supervisions

If you work on any of the above, or any related topic, I will be happy to consider your application for a PhD supervision (including a joint supervision).

Desagulier, Guillaume (2014). “Visualizing distances in a set of near synonyms: rather, quite, fairly, and pretty.” In: Corpus Methods for Semantics: Quantitative Studies in Polysemy and Synonymy. Ed. by Dylan Glynn and Justyna Robinson. Amsterdam: John Benjamins.

]]>https://corpling.hypotheses.org/2965/feed0Mapping lexical variation with Tableau softwarehttps://corpling.hypotheses.org/2853
https://corpling.hypotheses.org/2853#respondSun, 03 Nov 2019 15:43:20 +0000https://corpling.hypotheses.org/?p=2853This post is a short introduction to plotting choropleth maps with Tableau, a commercial data visualization software. I show how to plot such maps using the BBC Voices dataset.

Tableau software

In a previous post, I described how to plot data from the BNC 2014 with R. As is often the case with R, the procedure is definitely not beginner-friendly, but at least one gets to plot exactly what one wants exactly the way one wants.

Tableau is a commercial data visualization interface,1 not a programming interface. This means that if you are used to shaping data in R, you are likely to feel puzzled at first because its internal logic is different.

If you are allergic to R programming and want to plot elegant maps without going through the trouble of climbing a steep learning curve, Tableau is the way to go. Within minutes, one goes from loading a dataset to visualizing the results, and the palette of graphics is comprehensive.

The great asset of this Tableau is its mapping functionality. It can plot latitude and longitude coordinates and it handles most kinds of GIS-compatible spatial files (Esri Shapefiles, Keyhole Markup Language, and GeoJSON).

The BBC Voices data

Sadly, although the BNC 2014 contains information about speakers’ geographic origins, not all UK counties are represented, which means that the corpus is not suitable for a comprehensive study of linguistic variation.

We turn to another dataset: the BBC Voices survey dataset (more specifically the RADAR 1 part), which is based on the BBC Voices project. Access to RADAR 1 is restricted to the readers of Upton and Davies (2013). I do not own the rights and I cannot share the data.

The BBC Voices dataset is based on a survey in which about 734,000 responses from about 84,000 informants to 38 open-ended questions were collected. Each question is meant to elicit the variants of a lexical alternation. For example, the question what do you call a young person in cheap trendy clothes and jewellery? is intended to elicit such responses as chav, asbo, townie, scally, or ned. The table below shows a portion of the alternations and variants present in the BBC Voices survey dataset.2

Alternations

Variants

To play truant

Skive, bunk, wag, play hookey, skip

Hit hard

Whack, smack, thump, wallop, belt

Drunk

Pissed, wasted

Pregnant

Up the duff, pregnant, bun in the oven, expecting

Left-handed

Cack-handed, left, cag-handed

Grandmother

Nanny, granny, grandma

Grandfather

Grandad, grandpa, grampa, pop

Young person in cheap trendy clothes and jewelery

Chav, townie, scally, ned

Trousers

Trousers, pants, keks, jeans, trews

Long soft seat in the main room

Sofa, settee, couch

Toilet

Loo, bog, toilet, lavatory

To rain lightly

Drizzle, spit, shower

To rain heavily

Pour, piss, chuck, bucket

Running water smaller than a river

Stream, brook, burn beck

I am using the dataset as adapted by Grieve et al. (2019), which Jack Grieve has kindly agreed to share with me. Here is a snapshot (a random selection of 20 observations).

REGION (postcode area)

variant (alternation)

score (%)

BR

path (Narrow walkway alongside buildings)

10.00

DH

well_off (Rich)

9.52

LA

keks (Trousers)

26.32

CB

nap (Sleep)

21.83

HA

trews (Trousers)

9.09

MK

baby (Baby)

49.49

HX

pumps (Child’s soft shoes worn for PE)

75.00

NG

expecting (Pregnant)

9.25

L

chilly (Cold)

12.38

BT

grandpa (Grandfather)

8.31

WF

trainers (Child’s soft shoes worn for PE)

10.26

NR

clobber (Clothes)

17.91

AB

whack (Hit hard)

12.98

BA

expecting (Pregnant)

10.48

NP

nippy (Cold)

8.11

NP

scally (Young person in cheap trendy clothes…)

4.76

G

shattered (Tired)

25.63

LA

fit (Attractive)

62.07

TS

poorly (Unwell)

49.47

SA

baby (Baby)

51.76

Originally, the BBC Voices dataset provides the percentage of informants (third column) in 124 UK postcode areas (first column) who supplied each variant (second column). The percentages are based on the complete set of variants. In Grieve et al. (2019), the percentages were recalculated for each variant in each postcode area based only on the variants kept in the analysis.

Loading the data

The original data file is a comma-separated text file. It has one column for REGION, and one column per word (only the first 5 words are displayed below).

This is what Tableau looks like when no data has been loaded yet. To load data, click on Data Source in the lower left corner.

A blank Tableau worksheet

Alternatively, you may want to start from the home screen (which the Data Source link will take you to).

The Tableau home screen

Click on Text file and select the desired data file in the interactive window.

The data have been successfully imported

In the REGION column, click on ABC…

Then Geographic Role > UK Postcode Area

You can see that the REGION icon has changed. Tableau can now use the column data to retrieve longitudes and latitudes.

You may now click on Go to Worksheet. This takes to you where you can edit the data.

On the left, drag Word from Dimensions environment to the Columns environment. Drag Longitude (generated) from the Measures environment to the same Column environment. Drag Latitude (generated) to the Rows environment. Drag Score from the Measures environment to the Marks environment. Do the same for Region from Dimensions. Right now, this is what you should obtain.

Drag SUM(Score) to the Color box and Region to the Label box. This will create a set of choropleth maps (one for each word) with the postcode tags.

Finally, click on the arrow on the Word tab and activate Show Filter.

The filter feature allows the user to select which words should be plotted, knowing that one map per word is plotted. A window listing all the words appears to the right.

Suppose we want to compare the distributions of ned and chav. All we have to do is unselect all the words (leave All unchecked) and select the two words that we are interested in. Two maps are plotted: one for chav (left) and one for ned (right). The darker the shade of blue, the more specific a word is to an area. It appears that ned is specific to Scotland. Chav is used mostly in England and Wales.

The video below provides a recap of the above and shows how to fine-tune the maps.

]]>https://corpling.hypotheses.org/2853/feed0Validating clusters in hierarchical cluster analysishttps://corpling.hypotheses.org/2675
https://corpling.hypotheses.org/2675#respondMon, 21 Oct 2019 12:13:59 +0000https://corpling.hypotheses.org/?p=2675In a previous post, I showed how to run HCA with the base-R hclust() function. Here, I introduce a package whose benefit is to provide a way of validating clusters: pvclust. This package allows the user to include confidence estimates through multiscale bootstrap resampling.

The motivation for this post is a I received after I advertised for hclust() on Twitter.

I think we need to be very careful about validating clusters. Like HCA will always return clusters even when there are none, at which point it’s essentially making arbitrary partitions of the data where highly similar observations are classified into different clusters.

Admittedly, HCA finds clusters even when we expect there to be none. Another related issue, that I have come up against many times is making sense of clusters that are at odds with the intuition that the research question builds upon. In other words, you expect certain clusters to appear, but other clusters appear, and they don’t seem to make sense.

So, HCA is designed to find clusters, and it will find some no matter what. The researcher should at least be allowed to decide whether these clusters are valid, based on some metric.

This possibility is implemented in the pvclust() function from eponym package. I used it in Desagulier (2014). It should be noted that pvclust includes hclust(). The former augments the latter with p-values.

pvclust provides p-values for hierarchical clustering based on multiscale bootstrap resampling. Let us see how this works. We load the same data set as the one we used here (prepositions) and we select the Brown corpus data.

The fit object is in fact a list that contains lots of numeric information (enter str(fit)).

It is now time to plot the dendrogram with plot().

plot(fit)

A cluster dendrogram of text categories in the Brown corpus based on the distribution of prepositions with pvclust

The plot should be read from bottom to top. There are three numbers around each node. The number below each node specifies the rank of the cluster (here, from 1 to 13, i.e. from the 1st generated cluster at the bottom to the 13th at the top). The two numbers above each node indicate two types of p-values, which are calculated via two different bootstrapping algorithms: AU and BP.1

The number on the left indicates an ‘approximately unbiased’ p-value (AU) and is computed by multiscale bootstrap resampling. The number on the right indicates a ‘bootstrap probability’ p-value (BP) and is computed by normal bootstrap resampling. The number on the left is a much better assessment of how strongly the cluster is supported by the data.

In either case, the closer the number is to 100 (i.e the closer the p-value is to 1), the more valid the cluster. For example, an AU p-value of, say, 90 implies that the hypothesis that the cluster is invalid is rejected with a significance level of 0.1.

Here, we see that not all clusters represent the data fairly accurately. Indeed, the mean AU score is 76.93 (sd = 13.5). The standard deviation is relatively hight because AU p-values range from 54 to 100. We want to select only those clusters that are valid.

Likerect.hclust(), pvclust allows the user to group clusters into user-defined classes. This is done with the pvrect() function.

The code below finds clusters with AU p-values (pv="au") greater than or equal to (type="geq") the threshold given by the alpha argument (here alpha=.80) and draws red rectangles around the branches that meet the condition.

pvrect(fit, alpha=.80, pv="au", type="geq")

A cluster dendrogram of text categories in the Brown corpus based on the distribution of prepositions with pvclust (alpha greater than or equal to 0.8)

Three clusters meet the condition. The issue that HCA will find clusters no matter what remains (other clustering methods such as correspondence analysis do not suffer from this shortcoming), but at least the user can select a level of significance above which the clusters can be taken into consideration.

This post provides an introduction to doing regional dialectology in the UK with R. More specifically, I focus on mapping lexical variables from the spoken component of the British National Corpus 2014. The goal is to see if we observe patterns of regional variation with respect to pre-identified lexical alternations.

I was inspired by two colleagues: Jack Grieve and Mathieu Avanzi. Jack is Professor of Corpus Linguistics in the Department of English Language and Linguistics at Birmingham University. I was happy to meet him in person at the Corpus Linguistics Summer School 2019, where I taught a course on exploratory statistics. Jack pointed me to a tutorial he wrote on mapping regional variation in the US based on Twitter data. Mathieu is Associate Professor in (French) Linguistics at Paris Sorbonne University. In France, he has made a name for himself with Le Français de nos Régions, a large-scale project aimed at visualising regional variation from lexical and phonetic variables collected via online surveys.

My goal is somewhat different because I do not use « live » data from online surveys or social networks. For the purposes of a course on corpus-based sociolinguistics, I want to plot lexical alternations on a map of the United Kingdom based on data from a corpus of spoken English.

Choropleth maps

The literature I have read on the topic has convinced me that I should go for a choropleth map.1 The map below displays the levels of popular education for each « département ». White denotes the highest education level, black the lowest education level, and shades of gray intermediate levels. Colors are indexed on a measurement.

Like a heatmap, a choropleth map visualizes the variation of a measurement across a geographic area. Unlike a heatmap, it displays measurements within pre-assigned geographic boundaries.2

One argument in favor of choropleths is their ability to visualize data in a simple way within easily recognizable geographic entities. One argument against it is that these geographic entities are coarse-grained and somehow artificial.

Shapefiles

The first step is to obtain a shapefile. Technically speaking, a shapefile is a collection of files that allows a GIS (Geographic Information System) to store and display data related to positions on the surface of the Earth.

The next step is to decide upon what resolution you want (cities, counties, districts, etc.). This decision depends on (a) what you consider relevant for the purpose of your study, and (b) the geographic breakdown of the country of interest. Most shapefiles for the UK break down into five levels:

NUTS 1 (regions),

NUTS 2 (counties, but see below),

NUTS 3 (districts),

LAU Level 1 (local authority districts),

LAU Level 2 (local authority wards).

For the purpose of this tutorial, we shall be working at the level of counties (NUTS 2).3 To get the corresponding shapefile, visit the Open Geography portal from the Office for National Statistics, click on download, and choose « shapefile » (the folder occupies 20.6 MB on your disk). Once the download is over, unzip the folder.

R Packages

Before proceeding further, install the required packages.

rm(list=ls(all=TRUE)library(rgdal)library(ggplot2)library(dplyr)

The rgdal package is used to load and process the shapefile. The ggplot2 package is used for producing the map. The dplyr package is used for data manipulation.

Loading the shapefile

To load the shapefile, use the readOGR() function from the rgdal package. The dsn argument should be the path to the shapefile folder. The layer argument collects the relevant files in the folder.

Convert the whole thing into a data frame. On this occasion, we use fortify(). We specify the desired level of granularity with region = "nuts218nm". Mind you, this line of code will keep R busy for quite some time!

uk.nuts2.shp.df <- fortify(uk.nuts2.shp, region = "nuts218nm")

Inspect the data frame.

head(uk.nuts2.shp.df)

We are ready to plot a map of the UK with ggplot2. First, run the ggplot() function and save the map.

Counties or regions?

The NUTS-2 divisions are the following:

Bedfordshire and HertfordshireBerkshire, Buckinghamshire and OxfordshireCheshireCornwall and Isles of ScillyCumbriaDerbyshire and NottinghamshireDevonDorset and SomersetEast AngliaEast WalesEast Yorkshire and Northern LincolnshireEastern ScotlandEssexGloucestershire, Wiltshire and Bath/Bristol areaGreater ManchesterHampshire and Isle of WightHerefordshire, Worcestershire and WarwickshireHighlands and IslandsInner London – EastInner London – WestKentLancashireLeicestershire, Rutland and NorthamptonshireLincolnshireMerseysideNorth Eastern ScotlandNorth YorkshireNorthern IrelandNorthumberland and Tyne and WearOuter London – East and North EastOuter London – SouthOuter London – West and North WestShropshire and StaffordshireSouth YorkshireSouthern ScotlandSurrey, East and West SussexTees Valley and DurhamWest Central ScotlandWest MidlandsWest WalesWest Yorkshire

In R, the above list can be accessed by entering:

levels(as.factor(uk.nuts2.shp.df$id))

In NUTS-2, some counties are grouped, which can be misleading. For example, Cambridgeshire is a county, but it does not appear in the list. You have to know that it has been included as part of East Anglia, which is a region.

It is time to populate the map with sociolinguistic data.

Data

To replicate one of the case studies found in a 2017 paper by Grieve, Montgomery, Nini, and Guo on lexical variation and social media in British English, I extracted all instances of sofa, couch, and settee produced by speakers located in the UK from the BNC 2014. I used my BNC.2014.query() script, described in a previous post, to harvest the data.

The dataset is available from my Nakala repository. To load it into R, enter:

We select the two columns that we need in data: data$word and data$city.

data.2 <- data[,c(2,5)]

We merge city.counties and data.2 using city as the column in column.

data.counties <- merge(data.2, city.counties, by="city")

I have used this data frame (more specifically the columns word and id) to perform a multinomial test. The purpose of this multinomial test is to see which words are specific to which counties. Run the next line of code to load the output of the test.

The measurements are the log-transformed p-values of the associations between each word (couch, settee, and sofa) and NUTS-2 divisions. Positive values indicate attraction and negative values indicate repulsion. For example, we see that settee is distinctive of Bedfordshire and Hertfordshire (log-transformed p-value = 0.67), whereas sofa is not (log-transformed p-value = -0.51). We might say that sofa is « anti-distinctive » of Bedfordshire and Hertfordshire. Nothing much can be said about couch with respect to Bedfordshire and Hertfordshire as the log-transformed p-value is close to 0.

One issue that we need to address now is this: not all NUTS-2 divisions are illustrated in the data. This is bound to be a problem when we plot the map as some parts of the UK are going to be absent. The map will look strange.

We load the full list of NUTS-2 divisions.

districts <- read.csv("https://nakala.fr/nakala/data/11280/b2ed82ae")

We join distinction and districts with the full_join() function from the dplyr package.

full.join <- full_join(distinction, district, by="id")

Although there are no measurements for the counties at the bottom of the table (see all the NAs), this will guarantee that they are plotted on the map. It may seem strange that all three words are used and not used in the exact same counties. This artificial effect is due to the kind of multinomial test that I have run.

It is now time to combine the shapefile and the measurements. Remember: the shapefile is heavy. This process will take some time.

df.for.map <- merge(uk.nuts2.shp.df, full.join, by="id", all=T)

Plotting the maps

We now have a file that contains measurements for all three words. Let us start by plotting a map for couch. First, we create the map object.

Discussion

results

It is hard from these maps to observe distinctive patterns of variation. Some local tendencies emerge, however. Kent and East Anglia display opposite preferences and dispreferences with respect to settee and sofa. The preference of Cheshire speakers for couch is all the more notheworthy as the neighbouring divisions display a dispreference.

measurement

The question of what measurement to include is worth mentioning. Raw frequencies are considered not a viable option by map experts. Minimally, these should be transformed into percentages. I have used log-transformed p-values from a test that selects only those divisions where the three words are used.

data

The data do not cover all NUTS-2 divisions. Comparatively, maps made from Twitter data rely on huge datasets. They are therefore more comprehensive, geographically speaking.

The BNC 2014 is much smaller, and it was not designed for sociolinguistic analysis at this level of geographical specificity.

Dr Robbie Love is part of the research team that was responsible for the compilation of the Spoken BNC 2014. I asked him what level of geographic specificity he recommended. He replied:

The categories that are provided in the metadata are as follows – level 4 is as specific as I think you could reasonably go, although as you know there’s hardly any data outside of the English regions pic.twitter.com/G21ghwby3V

Judging from the above table, maps made from corpora like the BNC 2014 are bound to be partial and skewed.

As Jack Grieves puts it:

No level will fix that though. Couch is Scotland. Settee is Midland/North. Sofa general and esp. South. They’re all hitting round east anglia. Not criticising BNC in general, but tricky for regional dialectology imo.

In my opinion, the bias comes from the fact that not all areas are represented, and also the uneven number of contributions per relevant areas. In all honesty, I expected this, but thought it would be nice to give it a try. In a future post, I will illustrate choropleths with more robust datasets.

What is it about?

The paper is summed up in the abstract. See the front page below.

The front page

I was willing to embark on NLP technology while remaining faithful to my corpus linguist’s spirit. My curiosity was piqued by the big buzz around deep neural networks and the promised they held in terms of image and sound recognition in linguistics. In the media, deep neural networks strike me as a tool that is primarily geared at killing the spirit of games such as chess, poker, or go, but there is so much more than they can do.

The project started with me wondering how word2vec and GloVe could be brought to use in the kinds of semantic-annotation tasks required in corpus linguistics. The kind of task that I’m talking about is the following. I started with a large data set like this one (except the actual data set was way larger):

A dataset ready for annotation

My goal was to annotate the adjectives to see if the intensifiers quite and rather have distinctive semantic preferences with respect to the kinds of adjectives that they modify. Option 1 is manual annotation. You get the best results because a human annotator is sensitive to context, polysemy, and is therefore very good at disambiguating. But it is excruciatingly long. Option 2 is automatic annotation with a semantic-tagger such as USAS. Now, unless taggers are probabilistic and well-trained, they are infamous for assigning incorrect tags to highly polysemous items (take a look at the meanings of « hot » in the above table to assess the difficulty of the task at hand).

Just like any other distributional-semantic model (DSM), word2vec and GloVe take a text corpus as input and output one vector for each word found in the corpus based on the context where each word appears. Unlike traditional DSMs, these methods are said to be more powerful because they are inspired by deep neural networks.

SGNS and GloVe are considered as neural, prediction-based embeddings. All of these methods are essentially bag-of-words models, in which the representation of each word reflects a weighted bag of context-words that co-occur with it (weighting and context are two important concepts here).1

I often read that word2vec is an example of deep learning, but this is not the case. It is a two-layer (therefore shallow) neural net. The details of word2vec/GloVe implementations are in the paper. Note that I focused on GloVe because I found it more intuitive and less suspicious than word2vec at the time. In fact, in both cases, the underlying computations are hidden.

Writing about fast-changing technology

Here are a few elements of context regarding my paper. I wrote it three years ago which, in the field of word vectors, is equivalent to a century in real life. When addressing reviewers’ comments, I must admit that I left the draft largely untouched. There are two reasons for this. First, I knew that in such a fast-changing field as neural networks, no matter how many updates I would make, the paper would unavoidably miss the latest developments in neural word embeddings by the time it is published. Heraclitus once said « No one ever steps in the same river twice ». Well, I must say that this is what it feels like for me to work on neural-network-flavored word vectors. You know you’re working on a clearly-identified field with clearly-identified algorithms, but new developments on how to best tune-in the parameters of said algorithms change monthly (not to say daily).

Second, I wanted the paper to be faithful to my state of mind as a linguist when I embarked on neural networks and its applications to meaning capture in corpora. My opinion on word2vec and GloVe has definitely changed in the last three years, but I must say the doubts I have with respect to their paradigm-changing aspirations remain.

Shared concerns about neural networks

Echoing my doubts, recent research on distributional semantics and machine learning tends to show that state-of-the-art deep learning techniques do not necessarily perform better than older alternatives (Dacrema et al., 2019). Levy et al. (2015) compare older methods such as Positive Pointwise Mutual Information (PPMI) and Singular Value Decomposition (SVD) to word2vec (in fact SGNS: skip-gram with negative-sampling), and GloVe. They find that performance depends on the task and how the hyperparameters are tuned. Tuning these hyperparameters is not straightforward.

About SGNS, they write:

SGNS is a robust baseline. While it might not be the best method for every task, it does not significantly underperform in any scenario. Moreover, SGNS is the fastest method to train, and cheapest (by far) in terms of disk space and memory consumption. » (p. 222).

This suggests that word2vec is good, but not as revolutionary as anticipated when it came out. Earlier in their paper, they write:

It is commonly believed that modern prediction-based embeddings perform better than traditional count-based methods. This claim was recently supported by a series of systematic evaluations by Baroni et al. (2014). However, our results suggest a different trend. (…) in word similarity tasks, the average score of SGNS is actually lower than SVD’s when win = 2, 5, and it never outperforms SVD by more than 1.7 points in those cases. In Google’s analogies SGNS and GloVe indeed perform better than PPMI, but only by a margin of 3.7 points (compare PPMI with win= 2 and SGNS with win= 5). MSR’s analogy dataset is the only case where SGNS and GloVe substantially outperform PPMI and SVD. Overall, there does not seem to be a consistent significant advantage to one approach over the other, thus refuting the claim that prediction-based methods are superior to count-based approaches. (p. 220)

According to Levy et al., GloVe, which I assumed would perform best because of its greater flexibility, does not fare better than SGNS in their experiments.

Downloading the paper

I am allowed to share 50 free online e-copies of this article with friends and colleagues via this link. After 50 downloads, the link will expire. The offer is therefore available while stocks last. If the link does not work anymore, a draft version is available for download from my HAL-SHS repository. Just click here to download a pdf copy.

]]>https://corpling.hypotheses.org/2682/feed0Clustering corpus data with hierarchical cluster analysishttps://corpling.hypotheses.org/2622
https://corpling.hypotheses.org/2622#respondMon, 17 Jun 2019 12:53:26 +0000https://corpling.hypotheses.org/?p=2622Hierarchical cluster analysis (HCA) belongs to the family of multifactorial exploratory approaches. What it does is cluster individuals based on the distance between them. I illustrate HCA with the preposition data set described here.

Hierarchical Cluster Analysis

HCA comes in two flavors: agglomerative (or ascending) and divisive (or descending). Agglomerative clustering fuses the individuals into groups, whereas divisive clustering separates the individuals into finer groups. What these two methods have in common is that they allow the researcher to find an optimal number of clusters to help explore a given data set. For reasons of space, and also because it is far more popular, I focus on agglomerative HCA.

HCA takes as input a table T that consists of i individuals (rows) and j variables (columns). The table can be a count matrix, a table of real numbers (with decimals), or a table containing both integers and real numbers. The table is converted into a distance matrix.1 The distance matrix is then amalgamated in such as way that the individuals in the distance object are merged into clusters. The analysis starts with each individual in a single cluster (represented by an uppercase letter) and then combines individuals progressively into larger clusters until a final stage where all individuals are merged into a single group. This stepwise process is represented graphically in the form of a tree-like diagram also known as a dendrogram.

In Fig. 1, the individuals are represented by uppercase letters. The plot should be read from bottom to top. The further up you go, the larger the clusters. For this reason, the method is called ascending/agglomerative hierarchical cluster analysis.

Fig. 1 A generic dendrogram

HCA is available from several R packages: hclust, diana, cluster (agnes()), or pvclust. In this section, I show how to use hclust() because it is part of base R.2

Text categories and prepositions in a corpus of US English

To illustrate HCA we return to preposition data set used here previously. This time, we focus on prepositions in the Brown corpus. Our goal is to cluster the fifteen text categories based on the prepositions that appear in each of them. We ignore the lengths of the prepositions.

Creating a distance matrix

Step 1 is done with the dist() function. Minimally, its main argument is the input matrix.

dist.mat<-dist(mat)

By default, the distance measure that is used to generate the distance matrix is the Euclidean metric. It is the simplest and most commonly used measure. For two individuals, the Euclidean distance is the square root of the sum of the squared differences between the pairs of corresponding values (Divjak and Fieller 2014, 417).

Amalgamating the clusters

Step 2 is done with the hclust() function, which clusters individuals in the distance matrix by means of an agglomeration method.

clusters = hclust(dist(mat))

By default, the agglomeration method is known as complete linkage: the distance between two clusters is defined as the greatest distance between a member of a cluster and a member of the other cluster (Everitt et al. 2011, 76).

Plotting the dendrogram

It is now time to plot the dendrogram with plot(). By specifying a negative value for the hang argument (hang = -1), the labels hang down from 0 and are neatly aligned (Fig. 2). The Height axis corresponds to the distance at which each fusion is observed.

plot(clusters, hang = -1)

Fig. 2 A cluster dendrogram of text categories in the Brown corpus based on the distribution of prepositions

Choosing the right measure

With HCA, one issue has to do with the choice of an appropriate distance measure and an appropriate amalgamation method. The default combination with dist() and hclust() is Euclidean–complete.

However, dist() proposes up to five other distance measures (maximum, manhattan, canberra, binary, and minkowski) and hclust() up to seven other amalgamation methods (ward.D, ward.D2, single, average, mcquitty, median, and centroid). The choice of one measure over another has an impact on the shape of the dendrogram, as evidenced in Fig. 4 which compares all six distance measures and applies Ward’s agglomeration method (Ward, 1963). As you can see, belles_lettres and learned_scientific are part of the same immediate cluster with all distance measures except binary.

If you start toying with agglomeration methods too, you realize that the number of combinations is high (6 distance measures × 8 agglomeration methods = 48 combinations).3There are good theoretical reasons for choosing one measure over the others. For an inventory of distance metrics, see Divjak and Fieller (2014, 417–418). For an inventory of agglomeration methods, see Everitt et al. (2011, 79).

In practice, however, the input matrices that tend to be compiled in corpus linguistics are sparse (i.e. matrices in which most of the elements are zero). In our input matrix, 2080 cells out out 3885 are zeros. Because the Canberra distance metric handles the relatively large number of empty occurrences well, it is an interesting option (Desagulier 2014, 163). With respect to the agglomeration method, Ward’s is widely used. Although sensitive to outliers (i.e. observations that deviate significantly from the other members of the sample in which they occur), it has the advantage of generating clusters of modelate size. As Divjak and Fieller (2014, 417–418) put it: “[u]se of squared distances penalises spread out clusters and so results in compact clusters without being as restrictive as complete linkage.”

In R, we select the distance measure as an argument of dist() and the amalgamation method as an argument of hclust(). Let us select the combination Canberra–Ward. We obtain Fig. 4.

Fig. 4 A cluster dendrogram of text categories in the Brown corpus based on the distribution of prepositions (distance: Canberra; amalgamation: Ward)

Divjak and Fieller (2014, 426) note the choice of a metric does not have much influence on the shape of the clusters. If they do, “thought must be given to why such differences occur and which of the methods is the most appropriate for the research questions of interest”. While this is true, to some extent, Fig. 3 shows that, depending on the kind of data, deciding upon which metric to use is not a trivial moment in the analysis.

Grouping clusters into classes

The rect.hclust() function allows you to groups clusters into user-defined classes. The code below draws five red rectangles around the branches of the dendrogram, highlighting five cluster classes.

cluster.classes <- rect.hclust(canberra.ward, 5)

The result is displayed in Fig. 5.

Fig. 5 A cluster dendrogram of text categories in the Brown corpus based on the distribution of prepositions with 5 cluster classes (distance: Canberra; amalgamation: Ward)

Inspection of the dendrogram reveals that the use of prepositions does not match the neat delimitation of text categories in the Brown corpus. For example, prepositions are not used identically in all press subgenres or all fiction subgenres. The most consistent cluster is the one in the middle of the dendrogram (fiction).

Because of its diagonal nature, the distance matrix is very similar to the tables of driving distances between cities that you find in the road atlases of the old days.

See Desagulier (2014) for an illustration of what you can do with pvclust.

There are more measures than those included in the dist() and hclust() functions, which means that the actual number of possible combinations is higher.

]]>https://corpling.hypotheses.org/2622/feed0A data-driven approach to identifying development stages in diachronic corpus linguisticshttps://corpling.hypotheses.org/2551
https://corpling.hypotheses.org/2551#respondThu, 23 May 2019 15:43:56 +0000https://corpling.hypotheses.org/?p=2551In a previous post, I showcased the development of the split infinitive 1 I wanted to check whether the split infinitive had spiked after the airing of the original Star Trek series in the late 1960s (to find out whether that was indeed the case, I invite you to read the post!). In other words, I had a theoretical time-partition in mind (before and after 1967), and I wanted to check whether it had any empirical relevance. Another strategy consists in letting the data decide what time-partitions are empirically relevant from the start.

The corpora that diachronic linguists work with are pre-partitioned into years or decades. The stacked barplot below compares the distributions of the split infinitive (blue) and the unsplit infinitive (orange) across the 20 decades spanned by the Corpus of Contemporary American English (1810s-2000s).

Distribution of the split (blue) and unsplit (orange) infinitives

The barplot does a good job at showing that the unsplit infinitive is more frequent overall than its unsplit counterpart. Also apparent in the plot is that the popularity of the split infinitive increases steadily from the 1940s onwards, whereas the reverse trend is observed for the unsplit infinitive over the same period.

The question is whether the plot can help us identify stages in the development of the split infinitive. To some extent, it can. In the first stage, which spans from the 1810s to the 1840s, the split infinitive is hardly used at all. A second stage spanning from the 1840s to the 1900s (with the exception of the 1890s) shows a pattern of moderate increase. A third stage between the 1900s and the 1940s corresponds to a period of decrease. Finally a fourth stage of dramatic increase is observed between the 1940s and the 2000s. With respect to the unsplit infinitive, we observe an increase from the 1810s until the 1920s and a slight decrease after that (notwithstanding periods of ups and downs at regular intervals).

A more elaborate method is proposed by Gries & Hilpert (2008): Variability-based Neighbor Clustering (VNC). It is similar to hierarchical cluster analysis (HCA, see this post), i.e. a method that displays a hierarchy of clusters, typically in the form of dendrograms with branches and leaves.2 The figure below exemplifies what a typical dendrogram looks like.

HCA dendrogram (distance metric: Euclidean; amalgamation rule: Ward)

The dendrogram is based on the frequencies of the 1600 most frequent types of the split infinitive found in COHA. HCA takes as input a contingency table which is then converted into a distance matrix with a distance metric (here Euclidean). Next, an amalgamation rule is applied. It specifies how the elements in the matrix are clustered. The decades that are the most similar are amalgamated first. The plot is complete when all clusters have been joined.

The clusters make sense. We see that the decades 1990s and 2000s belong to the same cluster. This is hardly surprising insofar as these two decades correspond to a period of dramatic increase for the split infinitive.

VNC proceeds likewise except that it does justice to the chronological linearity of linguistic developments by not separating adjacent decades. When VNC is run on the basis of a single string of frequency values, it uses the standard deviation as a similarity measure and averaging as amalgamation rule.

We apply VNC to the frequency development of the split infinitive. The script proceeds as follows. It takes as input the sequence of frequencies shown in the frequency table below.

For each pair of adjacent decades, the algorithm determines the standard deviation.3 For example, for the decades 1830 (frequency = 10) and 1840 (frequency = 25), the standard deviation is 10.61 (see leftmost column in the table below). In R, the standard deviation is calculated with the sd() function.

round(sd(c(10,25)), 2)[1] 10.61

An illustration of how VNC clusters decades (in dark blue: smallest standard deviation for each iteration; in orange: decades that are clustered as a result)

In the first iteration (leftmost column in the above table), VNC finds that the two decades with the smallest standard deviation are 1810 and 1820. These two are therefore merged first and assigned a mean frequency of \frac{4+4}{2}=4. This new value, combined with the others (the column « freqs » in Iteration 2) serves as the basis for the second VNC iteration. This time, the closest neighbors are [1810-1820] and 1830 (standard deviation = 3.46).

round(sd(c(4,4,10)), 2)[1] 3.46

These two periods are merged and a mean value is computed for the third iteration.

As VNC proceeds through the iterations (five of which are displayed in the above table), the time groupings become larger until all 20 decades are merged into a single large cluster. This happens when the last decade (2000) is merged with the last-but-one cluster.

The script displays the clusters in the form of a dendrogram.

VNC dendrogram illustrating the development stages of the split infinitive in COHA

The dendrogram shows which periods have been merged. The heights of the clusters indicate is a measure of how different they are with respect to each other. The clusters merge at the heights of the cumulative sums of standard deviations.

We see that, indeed, the decades 1810 and 1820 are merged first, and that the height at which they are merged corresponds to their standard deviation (0). The next two periods to be merged are [1810-1820] and 1830, which display the smallest standard deviation (3.46) in the second iteration of the algorithm. The first two standard deviations add up to 3.46, which is the height of the second cluster.

We are left with a question: how many development stages are considered relevant to the diachronic study? The scree plot below allows the linguist to decide by comparing the distances between successive mergers. This is measured, once again, with standard deviations.

VNC scree plot displaying the distances of all clusters in reverse order, starting with the last one (2000)

As we move from left to right, the difference between mergers decreases gradually. We start from the peak of the slope (on the left), go down a steep decrease, and count the number of clusters until we reach a point from where the slope levels off. The graph shows that the greatest differences are found between the 5 leftmost clusters, which are in fact the last 5 clusters. The data suggest that partitioning the 1810-2000 period into 5 development stages is relevant. Of course, we can perfectly decide otherwise and add more clusters, because the farther right we go, the more information we get. But the farther right we go, the less each additional cluster captures substantial information.

We end up with the following time partitions:

1810-1860

1870-1910

1920-1960

1970-1980

1990

2000

Given that the decades 1990 and 2000 are characterized by dramatic increase, we can arguably merge them. It looks like the split infinitive did undergo some change between the 1960s and the 1970s. However, this shift is but the tip of the iceberg. The split infinitive has undergone other shifts at varying rates over the last two centuries.

VNC dendrogram with development stages

There is more to VNC than what I have shown here. Extended applications such as VNC on the basis of multiple measurements, detection of outliers with VNC, and VNC and constructional change can be found in Hilpert (2013: Sect. 2.3).

« The standard deviation (σ) is the most widely used measure of dispersion. It is the square root of the variance. » (Desagulier, 2017: 148)

]]>https://corpling.hypotheses.org/2551/feed0Corpus Linguistics Summer School 2019https://corpling.hypotheses.org/2530
https://corpling.hypotheses.org/2530#respondMon, 11 Mar 2019 12:54:17 +0000https://corpling.hypotheses.org/?p=2530I am pleased to announce that I will teach a course on exploratory statistics for corpus linguistics at the Corpus Linguistics Summer School. The summer school will take place at the University of Birmingham from 24 to 28 June 2019.

]]>https://corpling.hypotheses.org/2530/feed0BNC.query(). An interactive R script for a sociolinguistic exploration of the spoken component of the BNC-XMLhttps://corpling.hypotheses.org/2252
https://corpling.hypotheses.org/2252#respondTue, 08 Jan 2019 16:22:14 +0000https://corpling.hypotheses.org/?p=2252BNC.query() is an interactive R script that I wrote for a course in computational sociolinguistics last semester. It is designed to run queries over the BNC-XML (spoken component). It extracts a single word or a complex expression, along with speaker information (gender, age class, and social grade), tabulates the results, computes frequencies, and makes a barplot or an association plot (along with a χ2 test).

For this demo, I have used R version 3.5.1 (2018-07-02) — « Feather Spray » on RStudio (Version1.1.463). I cannot guarantee that the script will work on previous versions.

Obtaining the corpus

For the script to run properly, the BNC-XML files must be installed on your system. To download the BNC-XML for free, visit the Oxford Text Archive and download the zipped archive. Once you have downloaded the corpus, unzip it and place it preferably at the root of your hard drive. The original architecture of the BNC-XML directory should be modified so that all the text files are located in a single folder.

The path to the folder that contains the corpus files

At some point, the script will need the path to the directory that contains the tagged corpus files. If you have installed the corpus at the root of your hard drive, the path is likely to be the following:

C:\\BNC-XML\\texts (PC);

/BNC-XML/texts (Mac).

If you want to avoid retyping the path each time you run a query, save it in a text file somewhere to copy/paste it later on.

Launching the program

Before you run BNC.query(), I advise you to check your working directory. The working directory is a folder which you want R to read data from and store output into. This is where the script will save your data (4 files). To know what your current working directory is, enter:

getwd()

To change the working directory, enter its path between quotes as an argument of setwd(). For example, if I want my working directory to be /Users/guillaumedesagulier, I enter:

Note that if you want to compare a one-word expression and a multiword expression (e.g. « hello » and « good morning »), you are going to have to run two separate queries.1

If you investigate a single lexeme or several simple lexemes, you may specify a POS tag. This option is not available for multiword expressions. Given that BNC-XML was tagged with CLAWS5, read the tagset and enter the correct tag (e.g. ITJ for an interjection, AJ0 for an adjective, etc.). If you do not want to bother looking for the tag, if you think the tag is irrelevant, or if the lexemes you investigate belong to different grammatical categories, enter \w+ (this is computer language for ‘one or more word characters’, i.e. any tag).

For the purpose of illustration, we explore non-standard plural agreement in the context of be (you/we/they was) in a comparative perspective. The search expression is the following:

(you|we|they) (was|were)

which translates as « you or we or they followed by a space followed by was or were« .

The script converts your search expression into XML…

Running the query over the whole corpus or just a sample

Next, the script prompts you to want to run the query over the whole spoken BNC-XML (908 files) or just a random sample (200 files). The latter is faster than the former and useful if you are not sure about the outcome of your query. Here, we enter 1 to run the query over the whole corpus.

The scripts prompts you to enter the path to the directory that contains the corpus files. If you work with a PC, don’t forget to double each backslash (see above). Because I work with a Mac, and because the corpus is installed at the root of my hard drive, I enter /BNC-XML/texts.

Select your operating system. This guarantees that the working files are saved with the proper specifications.

The script prints the paths to the corpus files. This is just to check that the path to the directory that contains them has been typed correctly. Press ENTER to launch the query.

The script loops over each corpus file. Sit back, relax, and wait. A progress bar indicates how much time you have to drink that coffee while the program is working. On my Mac, the script takes between 5 and 10 minutes to complete when all the corpus files are selected.2

Once the script has finished looping over the files, press ENTER to save the results in a text file named <interim.results.txt>.(( The file is also saved as an R data file named <interim.results.rds>. )) If you open the text file, you will see that there is one line per observation. Each word is described by the ID of the speaker who used it. Next, press ENTER to save the results in a text file named <data.final.txt>.3

After pressing ENTER, the first 20 lines of <data.final.txt> are displayed. Each observation consists of the speaker ID, the word or expression itself, the speaker’s gender, age group, and social grade. Read the BNC-XML documentation to know more about these variables.

You may now plot the results. Three options are available:

a barplot based on raw frequencies,

an association plot,

no plot (+ archive working files).

The association plot makes sense only when your study consists of at least two words or expressions.

For the working files to be archived means that they are time-stamped. This prevents the script from over-writing your working files when you run another query.

We choose the association plot. We can investigate three effects:

gender,

age,

social class.

We choose social class.

The script provides the table of counts with the variable(s) in the rows and the age groups in the columns. It also runs a χ2 test with simulated p-values (see Corpus Linguistics and Statistics with R, Sect. 8.9). If your table does not meet the test’s assumptions, an error message may be issued. This is likely to occur if you choose to work on a portion of the corpus only because the frequency counts might be too low.

The association plot below can be exported as an image or as pdf from R or RStudio. Each cell of the table of counts is represented by a tile whose area is proportional to the difference between observed and expected frequencies. The dotted line is the baseline. It represents independence between the variables. If the observed frequency of a cell is greater than its expected frequency, the tile appears above the baseline and is shaded black. If the observed frequency of a cell is smaller than its expected frequency, the tile appears below the baseline and is shaded red. One interesting finding here is that speakers from social group C2 (skilled working class) prefer the non-standard variant. (( For more details on how to interpret an association plot, see Corpus Linguistics and Statistics with R, pp. 184-185. ))

You now have the choice between running another query or exiting the program. If you choose to exit, the script adds a time stamp to your working files, for archiving purposes and later re-use. If my working files are time-stamped as follows:

<interim.results_08Jan2019141334.txt>, and

<data.final_08Jan2019141334.txt>,

this means that they were created on January 8th, 2019 at 2.13pm (and 34 seconds!). Again, as long as you do not run the script from scratch again, your current working files are still intact.

Built-in plotting functions

If you want to make several plots, you can do so by using a built-in function. This will only work for the current session, i.e. as long as the working files have not been over-written. Here is the list of functions:

function call

what it does

make.barplot()

opens the barplot menu

barplot.gender()

makes a barplot based on gender

barplot.age()

makes a barplot based on age groups

barplot.social.class()

makes a barplot based on social grades

make.assocplot()

opens the association plot menu

assplot.gender()

makes an association plot based on gender

assplot.age()

makes an association plot based on age groups

assplot.soc.class()

makes an association plot based on social grades

The barplot below is obtained by entering barplot.social.class(). It is made with ggplot2.

Video recap

The video below has been edited. In real life, the script takes slightly longer to run. The processing time depends on your hardware.

Cite this article as: Guillaume Desagulier, "BNC.query(). An interactive R script for a sociolinguistic exploration of the spoken component of the BNC-XML," in Around the word, 08/01/2019, https://corpling.hypotheses.org/2252.

If you want to plot the results, you can easily merge the output files later on.

]]>https://corpling.hypotheses.org/2252/feed0BNC.2014.query(). An interactive R script for a sociolinguistic exploration of the spoken component of the BNC-2014https://corpling.hypotheses.org/1632
https://corpling.hypotheses.org/1632#commentsThu, 03 Jan 2019 08:40:41 +0000https://corpling.hypotheses.org/?p=1632Last edit: June 7th, 2019

BNC.2014.query() is an interactive R script that I wrote for a course in computational sociolinguistics last semester. It is designed to run queries over the BNC-2014 (spoken component). It extracts a single word or a complex expression, along with speaker information (gender, age class, and social grade), tabulates the results, computes frequencies, and makes a barplot or an association plot (along with a χ2 test).

For this demo, I have used R version 3.5.1 (2018-07-02) — « Feather Spray » on RStudio (Version1.1.463). I cannot guarantee that the script will work on previous versions.

Obtaining the corpus

For the script to run properly, the BNC-2014 files must be installed on your system. To download the BNC-2014 for free, visit this page and complete the signup form at the bottom of the page. Once you have downloaded the corpus, unzip it and place it preferably at the root of your hard drive. The architecture of the bnc2014spoken directory is displayed below.

The path to the folder that contains the corpus files

At some point, the script will need the path to the directory that contains the tagged corpus files. If you have installed the corpus at the root of your hard drive, the path is likely to be the following:

C:\\bnc2014spoken\\spoken\\tagged (PC);

/bnc2014spoken/spoken/tagged (Mac).

If you want to avoid retyping the path each time you run a query, save it in a text file somewhere to copy/paste it later on.

Launching the program

Before you run BNC.2014.query(), I advise you to check your working directory. The working directory is a folder which you want R to read data from and store output into. This is where the script will save your data (4 files). To know what your current working directory is, enter:

getwd()

To change the working directory, enter its path between quotes as an argument of setwd(). For example, if I want my working directory to be /Users/guillaumedesagulier, I enter:

For the purpose of illustration, we compare the use of « hello » and « hi ». The search expression is (hi|hello).

Note that if you want to compare a one-word expression and a multiword expression (e.g. « hello » and « good morning »), you are going to have to run two separate queries.1

If you investigate a single lexeme or several simple lexemes, you may specify a POS tag. This option is not available for multiword expressions. Given that BNC-2014 was tagged with CLAWS7, read the tagset and enter the correct tag. Both « hello » and « hi » are interjections. The CLAWS7 tag for interjections is UH. If you do not want to bother looking for the tag, if you think the tag is irrelevant, or if the lexemes you investigate belong to different grammatical categories, enter \w+ (this is computer language for ‘one or more word characters’, i.e. any tag).

The script converts your search expression into XML…

Entering the path to the directory that contains the corpus files

Next, the scripts prompts you to enter the path to the directory that contains the corpus files. If you work with a PC, don’t forget to double each backslash (see above).

Running the query over the whole corpus or just a sample

After entering the path to the corpus files, the script asks you if you want to run the query over the whole corpus (1251 files) or just a sample (100 files). The latter is much faster than the former and useful if you are not sure about the outcome of your query. Here, we enter 1 to run the query over the whole corpus.

The script prints the paths to a random selection of ten corpus files. This is just to check that the path to the directory that contains them has been typed correctly. Press ENTER to launch the query.

Now, the script loops over each corpus file. Sit back, relax, and wait. A progress bar indicates how much time you have to drink that coffee while the program is working.

On my Mac, the script takes between 5 and 10 minutes to complete when all the corpus files are selected.2

Once the script has finished looping over the files, press ENTER to save the results in a text file named <interim.results.BNC.2014.txt>.(( The file is also saved as an R data file named <interim.results.BNC.2014.rds>. )) Here is a snaspshot of the file. There is one line per observation. Each word is described by the ID of the speaker who used it.

Next, press ENTER to save the results in a text file named <data.final.BNC.2014.txt>.3

The first 20 lines of <data.final.BNC.2014.txt> are displayed. Each observation consists of the speaker ID, the word itself, the age group, the gender, the speaker’s city, the speaker’s dialect, and the social grade. Read the BNC-2014 documentation to know more about these variables.

You may now plot the results. Three options are available:

a barplot based on raw frequencies,

an association plot,

no plot (+ archive working files).

The association plot makes sense only when your study consists of at least two words or expressions.

For the working files to be archived means that they are time-stamped. This prevents the script from over-writing your working files when you run another query.

We choose the association plot. We can investigate three effects:

gender,

age,

social class.

We choose age.

The script provides the table of counts with the variable(s) in the rows and the age groups in the columns. It also runs a χ2 test. If your table does not meet the test’s assumptions, an error message may be issued. This is likely to occur if you choose to work on a portion of the corpus only. (see Corpus Linguistics and Statistics with R, Sect. 8.9)

The association plot below can be exported as an image or as pdf from R or RStudio. Each cell of the table of counts is represented by a tile whose area is proportional to the difference between observed and expected frequencies. The dotted line is the baseline. It represents independence between the variables. If the observed frequency of a cell is greater than its expected frequency, the tile appears above the baseline and is shaded black. If the observed frequency of a cell is smaller than its expected frequency, the tile appears below the baseline and is shaded red. We see that speakers from age groups 19-29 and, to a much lesser extent 60-69, prefer hi over hello.4

You now have the choice between running another query or exiting the program. If you choose to exit, the script adds a time stamp to your working files, for archiving purposes and later re-use. If my working files are time-stamped as follows:

<interim.results.BNC.2014_29Dec2018121603.txt>, and

<data.final.BNC.2014_29Dec2018121603.txt>,

this means that they were created on December 29th, 2018 at 12.16pm (and 3 seconds!). Again, as long as you do not run the script from scratch again, your current working files are still intact.

Built-in plotting functions

If you want to make several plots, you can do so by using a built-in function. This will only work for the current session, i.e. as long as the working files have not been over-written. Here is the list of functions:

function call

what it does

make.barplot()

opens the barplot menu

barplot.gender()

makes a barplot based on gender

barplot.age()

makes a barplot based on age groups

barplot.social.class()

makes a barplot based on social grades

make.assocplot()

opens the association plot menu

assplot.gender()

makes an association plot based on gender

assplot.age()

makes an association plot based on age groups

assplot.soc.class()

makes an association plot based on social grades

The barplot below is obtained by entering barplot.age(). It is made with ggplot2.

Video recap

The video below has been edited. In real life, the script takes slightly longer to run. The processing time depends on your hardware.

Feedback

This program is still in beta. If you experience bugs or think of better or faster functionalities, please leave a comment.

Citing the program

If you use the program for your research, please cite it as follows:

Desagulier, Guillaume. 2019. BNC.2014.query(), v. 0.3. An R script for a sociolinguistic exploration of the spoken component of the BNC 2014.

Cite this article as: Guillaume Desagulier, "BNC.2014.query(). An interactive R script for a sociolinguistic exploration of the spoken component of the BNC-2014," in Around the word, 03/01/2019, https://corpling.hypotheses.org/1632.

Marsaz, Drôme, France – Jan 2019

If you want to plot the results, you can easily merge the output files later on.

]]>https://corpling.hypotheses.org/1632/feed7The spoken component of the British National Corpus 2014 is out!https://corpling.hypotheses.org/1388
https://corpling.hypotheses.org/1388#respondSat, 22 Dec 2018 22:11:50 +0000https://corpling.hypotheses.org/?p=1388British National Corpus 2014 is a project led by the Centre for Corpus Approaches to Social Science at Lancaster University to create a 100M word corpus of contemporary British English, the BNC-XML, which is now over 20 years old. On November 19th, 2018, the spoken component of the BNC 2014 was made available for download for offline analysis. Before then, it was available via Lancaster University’s CQPweb. It is now accessible online in full, free of charge.

The 11.5-million-word spoken component of the BNC2014 consists of transcripts of recorded conversations involving 672 speakers from different parts of the UK between 2012 and 2016. The corpus breaks down into 1,251 files, i.e. one per conversation.

The ‘old’ BNC

I remember how thrilled my syntax professor was, back in 1999, when he announced that my department had purchased a CD-Rom copy of the original British National Corpus (BNC 1.0 if I remember correctly). When we asked what we could do with it, he replied: « Well, I’m sure we can glean some interesting examples. » He was right, of course, but browsing the corpus in search of interesting examples, just like one would chase butterflies with a net, was not even the tip of the iceberg of things to do with such a textual treasure trove. It is not until years later, when I became a corpus linguist, armed with the proper techniques, that I realized the full range of queries, extractions, tabulations, and quantifications that came with the mining of these raw files. I remember purchasing the BNC-XML as a CD-ROM shortly after its release in 2007. It is now available for download in full, free of charge from the Oxford Text Archive, along with other versions (BNC sampler and baby edition).

The BNC-XML is large (4,049 corpus files, about 100M word tokens) and annotated: it consists of more than the actual words in the document. Each corpus file comes with a TEI header, which provides a rich and structured description of its contents (mode, topic, genre, subgenre, date of original production, author/speaker, encoding, revisions, etc.). This information makes the file machine-processable.

Each word is tagged in a XML fashion, following the CLAWS5 tagset. Here is sentence 126 from file A00.xml:

Each word is delimited by a start tag (<w...>) and a closing tag (</w>). The start tag contains:

a code letter (w for « word »);

a POS tag based on the CLAWS5 tagset (e.g. AV0 for an adverb);

the head word (hw), or lemma;

the value of the CLAWS5 wordclass (e.g. ADV for an adverb).

Put together, these annotations yield a file that looks like the following:

Figure 1. A snapshot from file A00.xml

For the above reasons, the BNC-XML has been one of my favorite corpus for many years. See chapters 3, 4, and 5 of my book to see what you can do with XML markup and tagging.

However, the BNC-XML is getting old. The texts and transcripts that the BNC-XML is made of were produced between the 1960s and 1993. There is nothing inherently wrong with using an ageing corpus, of course, but English has evolved since then, and an update is never a bad thing.

So what’s new?

The new BNC is meant to be an updated version of its predecessor with a target objective of 100M words, once the written component is appended to the current project. The structure is therefore very similar, except this time the metadata are kept apart from the corpus files.

Figure 3. The file structure of the spoken BNC 2014

The corpus comes in two flavors: tagged and untagged. The structure of the tagged corpus files is slightly different from what we had in the original BNC. This time, we have one word per line. Below is a screenshot from file S2A5-tgd.xml.

BNC 2014 frequency lists

In future posts, I will present concrete illustrations of what can be done with this corpus. In the meantime, here are two BNC2014 frequency lists: one unlemmatized, the other lemmatized. The code that was used to make these freqlists is from my book ;-).