Abstract (Brunner's talk): Large astronomy surveys hold the potential to address fundamental questions about our Universe. But to fully leverage the utility of these data require new approaches. One area where this has become fundamentally clear is source classification, where we must cleanly separate stars from galaxies. In this presentation Brunner will review this classification challenge, discuss some recent work, and highlight how this work is relevant to NCSA.

Title (Underwood's talk): "Learning to recognize literary genre in a collection of documents that are internally heterogenous"?

Abstract (Underwood's talk): The problem of classifying text has received a great deal of attention in the field of machine learning. But classification algorithms are typically tested on collections of relatively short documents — articles, for instance, in newspapers or journals. Historians and literary scholars confront a different problem domain. Documents come to us in volume-sized packages that contain heterogenous parts: for instance, a collection of poems and plays may be prefaced by a prose introduction. Fortunately, segmenting volumes and classifying genres can be viewed as mutually supportive aspects of a single workflow. The models we develop in a first pass at genre classification can be used to divide volumes into segments that are generically distinct along simple axes (e.g. poetry/prose, fiction/nonfiction). Once that division is complete, we can train subtler classification models on the segments and learn to distinguish subgenres (e.g. lyric poetry or the detective story). The whole problem is interestingly complicated because literary texts can belong to more than one genre, and domain experts don't necessarily have an exhaustive list of genres. So at later stages of classification, the process may take on an unsupervised or semi-supervised character. I've completed the first stage of classification (poetry/drama/prose fiction/nonfiction) in a collection of 500,000 volumes; I'll present those models and describe a prototype segmentation algorithm that relies on them.