Join the Department of Computer Science on Monday, May 8 at 4pm as we welcome Dr. David Bamman (UC Berkeley). With the rise of large-scale digitization, we now have access to large textual datasets. Bamman will outline the opportunities this data presents for distant reading research and he will discuss the computational challenges involved in using this data for historical analysis. Location: Room 120 New Computer Science. CSE 600 students please sign in for credit.

Challenges in the Computational Analysis of Large Book Corpora

Abstract:
With the rise of large-scale digitization efforts over the past ten years (such as those by Google Books, the HathiTrust, and the Internet Archive), we now have access to large textual datasets preserving our cultural record in the form of printed books. These text collections have driven research at the intersection of computational methods and the humanities, exploiting advances made over past thirty years in natural language processing and machine learning.

In this talk, I'll outline some of the opportunities this data presents for research in "distant reading" (such as modeling the changing portrayal of women as fictional characters over time), and focus on the computational challenges involved in using this data for historical analysis. While much research in NLP has been heavily optimized to the domain of contemporary newswire, far less has addressed historical and literary texts written in a variety of languages and dialects, each with long, complex structure and noisy records of production. While these challenges inhibit out-of-the-box analysis, they present opportunities for collaborative research engaging scholars across disciplines; I'll discuss progress made to date toward solving them.

Bio:
David Bamman is an assistant professor in the School of Information at UC Berkeley, where he works on applying natural language processing and machine learning to empirical questions in the humanities and social sciences. His research often involves adding linguistic structure (e.g., syntax, semantics, coreference) to statistical models of text, and focuses on improving NLP for a variety of languages and domains (such as literary text and social media). Before Berkeley, he received his PhD in the School of Computer Science at Carnegie Mellon University and was a senior researcher at the Perseus Project of Tufts University.