The present study explores automatic classification of Swedish politicians and their speeches into classes based on personal traits-gender, age, and political affiliation-as a means for measuring and analyzing how these traits influence language use. Support Vector Machines classified 200-word passages, represented by binary bag-of-word-forms vectors. Different feature selections were tried. The performance of the classifiers was assessed using test data from authors unseen in the training data. Author-level predictions derived from twenty-one text-level predictions reached an accuracy rate of 81.2% for gender, 89.4% for political affiliation, and 78.9% for age. Classification concerning each basic distinction was applied to general populations of politicians and to cohorts defined by the other classes. The outcomes suggest that the extent to which these personal traits are expressed in language use varies considerably among the different cohorts and that different traits affect different layers of the vocabulary. The accuracy rates for gender classification were higher for the right wing and older cohorts than for the opposite ones. Age prediction gave higher accuracy for the right wing cohort. Political classification gave the highest accuracy rates when all forms were included in the feature sets, whereas feature sets restricted to verbs or function words gave the highest scores for gender prediction, and the lowest ones for political classification.

We propose an automatic method for attributing manuscript pages to scribes. The system uses digital images as published by libraries. The attribution process involves extracting from each query page approximately letter-size components. This is done by means of binarization (ink-background separation), connected component labelling, and further segmentation, guided by the estimated typical stroke width. Components are extracted in the same way from the pages of known scribal origin. This allows us to assign a scribe to each query component by means of nearest-neighbour classification. Distance (dissimilarity) between components is modelled by simple features capturing the distribution of ink in the bounding box defined by the component, together with Euclidean distance. The set of component-level scribe attributions, which typically includes hundreds of components for a page, is then used to predict the page scribe by means of a voting procedure. The scribe who receives the largest number of votes from the 120 strongest component attributions is proposed as its scribe. The scribe attribution process allows the argument behind an attribution to be visualized for a human reader. The writing components of the query page are exhibited along with the matching components of the known pages. This report is thus open to inspection and analysis using the methods and intuitions of traditional palaeography. The present system was evaluated on a data set covering 46 medieval scribes, writing in Carolingian minuscule, Bastarda, and a few other scripts. The system achieved a mean top-1 accuracy of 98.3% as regards the first scribe proposed for each page, when the labelled data comprised one randomly selected page from each scribe and nine unseen pages for each scribe were to be attributed in the validation procedure. The experiment was repeated 50 times to even out random variation effects.

This article explores a minimally supervised method for extracting components, mostly letters, from historical manuscripts, and clustering them into classes capturing linguistic equivalence. The clustering uses the DBSCAN algorithm and an additional classification step. This pipeline gives us cheap, but partial, manuscript transcription in combination with human annotation. Experiments with different parameter settings suggest that a system like this should be tuned separately for different categories, rather than rely on one-pass application of algorithms partitioning the same components into non-overlapping clusters. The method could also be used to extract features for manuscript classification, e.g. dating and scribe attribution, as well as to extract data for further palaeographic analysis.

308. Code and Data for “Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters”

Code and data for the article Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters (to appear in DHN2020 Digital Humanities in the Nordic Countries}, Riga, 17--20 March 2020).

The study based on this code and dataset is a comparative exploration of different classification tasks for Swedish medieval charters (transcriptions from the SDHK collection) and different classifier setups. In particular, we explore the identification of the issuer, place of issue, and decade of production. The experiments used features based on lowercased words and character 3- and 4-grams. We evaluated the performance of two learning algorithms: linear discriminant analysis and decision trees. For evaluation, five-fold cross-validation was performed. We report accuracy and macro-averaged F1 score. The validation made use of six labeled subsets of SDHK combining the three tasks with Old Swedish and Latin. Issuer identification for the Latin dataset (595 charters from 12 issuers) reached the highest scores, above 0.9, for the decision tree classifier using word features. The best corresponding accuracy for Old Swedish was 0.81. Place and decade identification produced lower performance scores for both languages. Which classifier design is the best one seems to depend on peculiarities of the dataset and the classification task. The present study does however support the idea that text classification is useful also for medieval documents characterized by extreme spelling variation.

The present study investigates a method for the attribution of scribal hands, inspired by traditional palaeography in being based on comparison of letter shapes. The system was developed for and evaluated on early medieval Caroline minuscule manuscripts. The generation of a prediction for a page image involves writing identification, letter segmentation, and letter classification. The system then uses the letter proposals to predict the scribal hand behind a page. Letters and sequences of connected letters are identified by means of connected component labeling and split into letter-size pieces. The hand (and character) prediction makes use of a dataset containing instances of the letters b, d, p, and q, cut out from manuscript pages whose scribal origin is known. Letters are represented by features capturing the distribution of foreground. Cosine similarity is used for nearest neighbor classification. The hand behind a page is finally predicted by means of a voting procedure taking the highest scoring letter-level hits as its input. This hand prediction method was evaluated on pages from five different hands and reached an accuracy above 99% for four of them and 87% for a fifth significantly more difficult one. The hand behind single toplisted letters was correctly predicted in 83% of the cases.

This article introduces Token Dependency Semantics (TDS), a surface‐oriented and token‐based framework for compositional truth‐conditional semantics. It is motivated by Davidson's ‘paratactic’ analysis of semantic intensionality (‘On Saying That’, 1968, Synthèse19: 130–146), which has been much discussed in philosophy. This is the first fully‐fledged formal implementation of Davidson's proposal. Operator‐argument structure and scope are captured by means of relations among tokens. Intensional constituent tokens represent ‘propositional’ contents directly. They serve as arguments to the words introducing intensional contexts, rather than being ‘ordinary’ constituents. The treatment of de re readings involves the use of functions (‘anchors’) assigning entities to argument positions of lexical tokens. Quantifiers are thereby allowed to bind argument places on content tokens. This gives us a simple underspecification‐based account of scope ambiguity. The TDS framework is applied to indirect speech reports, mental attitude sentences, control verbs, and modal and agent‐relative sentence adverbs in English. This semantics is compatible with a traditional view of syntax. Here, it is integrated into a Head‐driven Phrase Structure Grammar (HPSG). The result is a straightforward and ontologically parsimonious analysis of truth‐conditional meaning and semantic intensionality.

This paper explores topic modeling (TM) as a tool for “dis- tant reading” of two Swedish literary corpora. We investigate what kinds of insight and knowledge a TM-based approach can provide to Swedish literary history, and which methodological difficulties are associated with this endeavour. The TM is based on 12- and 24-term chunks of selected verb and common noun lemmas. We generate models with 20, 40, and 100 topics. We also propose a method for a quantitative and qualita- tive gendered thematic analysis by combining TM with a study of how the topics relate to gender in characters and authors. The two corpora contain, respectively, Swedish classics (1821–1941) and recent bestsellers (2004–2017). We find that most of the topics proposed by the TM are easy to interpret as conceptual themes, and that the “same” themes ap- pear for the two corpora and for different TM settings. The study allows us to make interesting observations concerning different aspects of gender and topic distribution.

The purpose with this article is to first make a brief presentation of the functions in the web based text processing tool Textin 1.2, and then to illuminate these functions by putting the program to use within a research project in progress that concerns developmental aspects on texts written by Swedish pupils during school years 5 to 9. The text will begin with a brief description of Textins’ main functions, and then move on to previous research on school texts where computer linguistic methods either were used or could have been used if the technology had been accessible at the time being. The article then continues with a presentation of the results that Textin delivers, and ends with a discussion on these findings.

This study describes a rule-based pseudonymisation system for Swedish clinical text and its evaluation. The pseudonymisation system replaces already tagged Protected Health Information (PHI) with realistic surrogates. There are eight types of manually annotated PHIs in the electronic patient records; personal first and last names, phone numbers, locations, dates, ages and healthcare units. Two evaluators, both computer scientists, one junior and one senior, evaluated whether a set of 98 electronic patients records where pseudonymised or not. Only 3.5 percent of the records were correctly judged as pseudonymised and 1.5 percent of the real ones were wrongly judged as pseudo, giving that in average 91 percent of the pseudonymised records were judged as real.

The Swedish Social Insurance Agency, (Försäkringskassan) receives 40 000 per month as well as phone calls from the citizens that are handled by almost 500 handling officers. To initiate the process to make their work more efficient we carried out two user-centered design workshops with the handling officers at Försäkringskassan with the objective of finding in what ways human language technology might facilitate their work. One of the outcomes from the workshops was that the handling officers required a support tool for handling and answering e-mails from their customers. Three main requirements were identified namely to find the correct template to be used in the e-mail answers, a support to automatically create templates and finally an automatic e-mail answering function. We will during two years focus on these design challenges within the IMAIL-project.

333. Using Uplug and SiteSeeker to construct a cross language search engine for Scandinavian

This paper presents how we adapted awebsite search engine for cross languageinformation retrieval, using theUplug word alignment tool for parallelcorpora.We first studied the monolingualsearch queries posed by the visitors ofthe website of the Nordic council containingfive different languages. In orderto compare how well different types ofbilingual dictionaries covered the mostcommon queries and terms on the websitewe tried a collection of ordinary bilingualdictionaries, a small manuallyconstructed trilingual dictionary and anautomatically constructed trilingual dictionary,constructed from the news corpusin the website using Uplug. The precisionand recall of the automaticallyconstructed Swedish-English dictionaryusing Uplug were 71 and 93 percent, respectively.We found that precision andrecall increase significantly in sampleswith high word frequency, but we couldnot confirm that POS-tags improve precision.The collection of ordinary dictionaries,consisting of about 200 000words, only cover 41 of the top 100search queries at the website. The automaticallybuilt trilingual dictionary combinedwith the small manually built trilingualdictionary, consisting of about2 300 words, and cover 36 of the topsearch queries.

This paper first describes an experiment to construct an English-Chinese parallel corpus, then applying the Uplug word alignment tool on the corpus and finally produce and evaluate an English-Chinese word list. The Stockholm English-Chinese Parallel Corpus (SEC) was created by downloading English-Chinese parallel corpora from a Chinese web site containing law texts that have been manually translated from Chinese to English. The parallel corpus contains 104 563 Chinese characters equivalent to 59 918 Chinese words, and the corresponding English corpus contains 75 766 English words. However Chinese writing does not utilize any delimiters to mark word boundaries so we had to carry out word segmentation as a preprocessing step on the Chinese corpus. Moreover since the parallel corpus is downloaded from Internet the corpus is noisy regarding to alignment between corresponding translated sentences. Therefore we used 60 hours of manually work to align the sentences in the English and Chinese parallel corpus before performing automatic word alignment using Uplug. The word alignment with Uplug was carried out from English to Chinese. Nine respondents evaluated the resulting English-Chinese word list with frequency equal to or above three and we obtained an accuracy of 73.1 percent.

This Special Issue contains three papers that are extended versions of abstracts presented at the Seventh Swedish Language Technology Conference (SLTC 2018), held at Stockholm University 8–9 November 2018.1 SLTC 2018 received 34 submissions, of which 31 were accepted for presentation. The number of registered participants was 113, including both attendees at SLTC 2018 and two co-located workshops that took place on 7 November. 32 participants were internationally affiliated, of which 14 were from outside the Nordic countries. Overall participation was thus on a par with previous editions of SLTC, but international participation was higher.

When it comes to the task of classification the data used for training is the most crucial part. It follows that how this data is processed and presented for the classifier plays an equally important role. This thesis attempts to investigate the performance of multiple classifiers depending on the features that are used, the type of classes to classify and the optimization of said classifiers. The classifiers of interest are support-vector machines (SMO) and multilayer perceptron (MLP), the features tested are word vector spaces and text complexity measures, along with principal component analysis on the complexity measures. The features are created based on the Stockholm-Umeå-Corpus (SUC) and DigInclude, a dataset containing standard and easy-to-read sentences. For the SUC dataset the classifiers attempted to classify texts into nine different text categories, while for the DigInclude dataset the sentences were classified into either standard or simplified classes. The classification tasks on the DigInclude dataset showed poor performance in all trials. The SUC dataset showed best performance when using SMO in combination with word vector spaces. Comparing the SMO classifier on the text complexity measures when using or not using PCA showed that the performance was largely unchanged between the two, although not using PCA had slightly better performance

In any software system or project, a continuous inflow of bug reports is an integral part of its upkeep and development. These bug reports, which could amount to a great number in a large system, are typically handled by several layers of human experts assigning the reports encountered to the corresponding developers. With the advancement in machine learning techniques on document classification, this task could be done automatically with high enough accuracy that the amount of human expert time required would be vastly reduced.In this thesis, we study automatic bug report assignment in the context of the telecom industry. In particular, we study the current state-of-the-art document representation and classification methods applied to bug reports with an emphasis on the usage of word embeddings and multilevel recurrent neural network (RNN). The model we emphasize is a two-level RNN model that incorporates document structure in its design, with the first level consisting of words sequence, representing a sentence and the second level consisting of a sequence of previously mentioned sentence representations, constructing the document representation.A bug report document differs from a general text document in a sense that it often contains boilerplate, software source code, error codes or machine-generated output that could only be understood by the system developers or maintainers and does not conform to common English document rules. This unique nature of the vocabulary with many unrelated symbols could deteriorate the accuracy of the classifiers. Therefore, in addition to document classification, we develop a boilerplate removal system based on stacked generalization ensemble classifier with shallow text features to separate templates, human-generated text and machine-generated text.We conducted our automatic bug report assignment on a sub-collection of eight years of bug reports from our industrial partner. Our experiments show that: (1) The multilevel RNN model performs better than the standard RNN model. (2) Bug report assignment is currently best handled by the stacked generalization ensemble method. (3) Using the Boilerplate removal system to extract only the human-generated text from the bug report documents, various classifiers perform relatively well with only 1/10th of the data in comparison to handcrafted preprocessing rules.

Whereas the DELOS DRM and the 5S model of digital libraries (DL) addresses the formal side of DL, we argue that a parallel 5M model is emerging as best practice worldwide, integrating multicultural, multilingual, multimodal digital objects with multivariate statistics-based document indexing, categorization and retrieval methods. The fifth M stands for the modeling the information searching behavior of users, and of collection development. We show how an extension of the 5S model to Hilbert space (a) points toward the integration of several Ms; (b) makes the tracking of evolving semantic content feasible, and (c) leads to a field interpretation of word and sentence semantics underlying language change. First experimental results from the Strathprints e-repository verify the mathematical foundations of the 5M model.

The AMICUS project was designed to promote scholarly networking in a topical area, motif recognition in texts, including its automation. Prior to doing so however it is necessary to show the theoretical underpinnings of the research idea. My argument is that evidence from different disciplines amounts to fragmented pieces of a bigger picture. By compiling them like pieces of a puzzle, one can see how the concept of formulaity applies to folklore texts and scholarly communication alike. Regardless of the actual name of the concept (e.g. motif, function, canonical form), what matters is that document parts and whole documents can be characterized by standard sequences of content elements, such formulaic expressions enabling higher-level document indexing and classification by machine learning, plus document retrieval. Information filtering plays a key role in the proposed technology.

Catalogs project subject field experience onto a multidimensional map which is then converted to a hierarchical list. In the case of the Aarne-Thompson-Uther Tale Type Catalog (ATU), this subject field is the global pattern of tale content defining tale types as canonical motif sequences. To extract and visualize such a map, we considered ATU as a corpus and ana-lysed two segments of it, “Supernatural adversaries” (types 300-399) in particular and “Tales of magic” (types 300-749) in general. The two corpora were scru-tinized for multiple motif co-occurrences and visualized by two-mode clustering of a bag-of-motif co-occurrences matrix. Findings indicate the presence of canonical content units above motif level as well. The organization scheme of folk narratives utilizing motif sequences is reminiscent of nucleotid sequences in the genetic code

In cultural heritage objects, digitized or not, content indicators occurring on higher than word level are often called motifs or their equivalent. Their recognition for document classification and retrieval is largely unresolved. Work on identifying rhetorical, narrative and persuasive elements in scientific texts has been progressing, in several, but largely unconnected tracks. The AMICUS project1 (running between 2009 and 2012) set out to test a possible way to resolve these issues, starting with the identification of Proppian functions in folk tale corpora and adapting the solution to the identification of tale motifs or their functional counterparts. AMICUS has devoted its first project year to listing the corpora, tools, methods and contacts available to address these issues. The initiators of the project have identified a common need in the processing of texts from both the cultural heritage (CH) and scientific communication (SC) domains: to perform automated, large-scale higher-order text analytics, i.e., to reach an advanced level of text understanding so that structured knowledge can be extracted from unstructured text. The four research groups propose to tackle an important aspect of this complex issue by investigating how linguistic elements convey motifs in texts from the CH and the SC domains. Our shared working hypothesis is that the identity of higherorder content-bearing elements, i.e., textual units that are typically designated for e.g. document indexing, classification, enrichment, and the like, strongly depends on community perception.

Based on a computed toy example, we offer evidence that by plugging in similarity of word meaning as a force plus a small modification of Newton’s 2nd law, one can acquire specific “mass” values for index terms in a Saltonesque dynamic library environment. The model can describe two types of change which affect the semantic composition of document collections: the expansion of a corpus due to its update, and fluctuations of the gravitational potential energy field generated by normative language use as an attractor juxtaposed with actual language use yielding time-dependent term frequencies. By the evolving semantic potential of a vocabulary and concatenating the respective term “mass” values, one can model sentences or longer strings of symbols as vector-valued functions. Since the line integral of such functions is used to express the work of a particle in a gravitational field, the work equivalent of strings can be calculated.

344. Using wavelet analysis for text categorization in digital libraries

Digital libraries increasingly bene t from re-
search on automated text categorization for improved
access. Such research is typically carried out by using
standard test collections. In this paper we present a
pilot experiment of replacing such test collections by
a set of 6000 objects from a real-world digital repos-
itory, indexed by Library of Congress Subject Head-
ings, and test support vector machines in a supervised
learning setting for their ability to reproduce the exist-
ing classi cation. To augment the standard approach,
we introduce a combination of two novel elements: us-
ing functions for document content representation in
Hilbert space, and adding extra semantics from lexical
resources to the representation. Results suggest that
wavelet-based kernels slightly outperformed traditional
kernels on classi cation reconstruction from abstracts
and vice versa from full-text documents, the latter out-
come due to word sense ambiguity. The practical imple-
mentation of our methodological framework enhances
the analysis and representation of speci c knowledge relevant to large-scale digital collections, in this case
the thematic coverage of the collections. Representation
of speci c knowledge about digital collections is one of
the basic elements of the persistent archives and the less
studied one (compared to representations of digital ob-
jects and collections). Our research is an initial step in
this direction developing further the methodological ap-
proach and demonstrating that text categorisation can
be applied to analyse the thematic coverage in digital
repositories.

Machine learning algorithms utilizing gradient descent to identify concepts or more general learnables hint at a so-far ignored possibility, namely that local and global minima represent any vocabulary as a landscape against which evaluation of the results can take place. A simple example to illustrate this idea would be a potential surface underlying gravitation. However, to construct a gravitation-based representation of, e.g., word meaning, only the distance between localized items is a given in the vector space, whereas the equivalents of mass or charge are unknown in semantics. Clearly, the working hypothesis that physical fields could be a useful metaphor to study word and sentence meaning is an option but our current representations are incomplete in this respect.For a starter, consider that an RBF kernel has the capacity to generate a potential surface and hence create the impression of gravity, providing one with distance-based decay of interaction strength, plus a scalar scaling factor for the interaction, but of course no term masses. We are working on an experiment design to change that. Therefore, with certain mechanisms in neural networks that could host such quasi-physical fields, a novel approach to the modeling of mind content seems plausible, subject to scrutiny.Work in progress in another direction of the same idea indicates that by using certain algorithms, already emerged vs. still emerging content is clearly distinguishable, in line with Aristotle’s Metaphysics. The implications are that a model completed by “term mass” or “term charge” would enable the computation of the specific work equivalent of sentences or documents, and that via replacing semantics by other modalities, vector fields of more general symbolic content could exist as well. Also, the perceived hypersurface generated by the dynamics of language use may be a step toward more advanced models, for example addressing the Hamiltonian of expanding semantic systems, or the relationship between reaction paths in quantum chemistry vs. sentence construction by gradient descent.

Previous work has suggested that parameter sharing between transition-based neural dependency parsers for related languages can lead to better performance, but there is no consensus on what parameters to share. We present an evaluation of 27 different parameter sharing strategies across 10 languages, representing five pairs of related languages, each pair from a different language family. We find that sharing transition classifier parameters always helps, whereas the usefulness of sharing word and/or character LSTM parameters varies. Based on this result, we propose an architecture where the transition classifier is shared, and the sharing of word and character parameters is controlled by a parameter that can be tuned on validation data. This model is linguistically motivated and obtains significant improvements over a mono-lingually trained baseline. We also find that sharing transition classifier parameters helps when training a parser on unrelated language pairs, but we find that, in the case of unrelated languages, sharing too many parameters does not help.

We extend the arc-hybrid transition system for dependency parsing with a SWAP transition that enables reordering of the words and construction of non-projective trees. Although this extension potentially breaks the arc-decomposability of the transition system, we show that the existing dynamic oracle can be modified and combined with a static oracle for the SWAP transition. Experiments on five languages with different degrees of non-projectivity show that the new system gives competitive accuracy and is significantly better than a system trained with a purely static oracle.

We present the Uppsala submission to the CoNLL 2017 shared task on parsing from raw text to universal dependencies. Our system is a simple pipeline consisting of two components. The first performs joint word and sentence segmentation on raw text; the second predicts dependency trees from raw words. The parser bypasses the need for part-of-speech tagging, but uses word embeddings based on universal tag distributions. We achieved a macroaveraged LAS F1 of 65.11 in the official test run and obtained the 2nd best result for sentence segmentation with a score of 89.03. After fixing two bugs, we obtained an unofficial LAS F1 of 70.49.