Abstract

Natural language processing employs computational techniques for the purpose of learning, understanding, and producing human language content. Early computational approaches to language research focused on automating the analysis of the linguistic structure of language and developing basic technologies such as machine translation, speech recognition, and speech synthesis. Today's researchers refine and make use of such tools in real-world applications, creating spoken dialogue systems and speech-to-speech translation engines, mining social media for information about health or finance, and identifying sentiment and emotion toward products and services. We describe successes and challenges in this rapidly advancing area.

Abstract

To reliably extract two entity types, symptoms and conditions (SCs), and drugs and treatments (DTs), from patient-authored text (PAT) by learning lexico-syntactic patterns from data annotated with seed dictionaries.Despite the increasing quantity of PAT (eg, online discussion threads), tools for identifying medical entities in PAT are limited. When applied to PAT, existing tools either fail to identify specific entity types or perform poorly. Identification of SC and DT terms in PAT would enable exploration of efficacy and side effects for not only pharmaceutical drugs, but also for home remedies and components of daily care.We use SC and DT term dictionaries compiled from online sources to label several discussion forums from MedHelp (http://www.medhelp.org). We then iteratively induce lexico-syntactic patterns corresponding strongly to each entity type to extract new SC and DT terms.Our system is able to extract symptom descriptions and treatments absent from our original dictionaries, such as 'LADA', 'stabbing pain', and 'cinnamon pills'. Our system extracts DT terms with 58-70% F1 score and SC terms with 66-76% F1 score on two forums from MedHelp. We show improvements over MetaMap, OBA, a conditional random field-based classifier, and a previous pattern learning approach.Our entity extractor based on lexico-syntactic patterns is a successful and preferable technique for identifying specific entity types in PAT. To the best of our knowledge, this is the first paper to extract SC and DT entities from PAT. We exhibit learning of informal terms often used in PAT but missing from typical dictionaries.

Abstract

We explore techniques for performing model combination between the UMass and Stanford biomedical event extraction systems. Both sub-components address event extraction as a structured prediction problem, and use dual decomposition (UMass) and parsing algorithms (Stanford) to find the best scoring event structure. Our primary focus is on stacking where the predictions from the Stanford system are used as features in the UMass system. For comparison, we look at simpler model combination techniques such as intersection and union which require only the outputs from each system and combine them directly.First, we find that stacking substantially improves performance while intersection and union provide no significant benefits. Second, we investigate the graph properties of event structures and their impact on the combination of our systems. Finally, we trace the origins of events proposed by the stacked model to determine the role each system plays in different components of the output. We learn that, while stacking can propose novel event structures not seen in either base model, these events have extremely low precision. Removing these novel events improves our already state-of-the-art F1 to 56.6% on the test set of Genia (Task 1). Overall, the combined system formed via stacking ("FAUST") performed well in the BioNLP 2011 shared task. The FAUST system obtained 1st place in three out of four tasks: 1st place in Genia Task 1 (56.0% F1) and Task 2 (53.9%), 2nd place in the Epigenetics and Post-translational Modifications track (35.0%), and 1st place in the Infectious Diseases track (55.6%).We present a state-of-the-art event extraction system that relies on the strengths of structured prediction and model combination through stacking. Akin to results on other tasks, stacking outperforms intersection and union and leads to very strong results. The utility of model combination hinges on complementary views of the data, and we show that our sub-systems capture different graph properties of event structures. Finally, by removing low precision novel events, we show that performance from stacking can be further improved.

Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?12th Annual Conference on Intelligent Text Processing and Computational LinguisticsManning, C. D.SPRINGER-VERLAG BERLIN.2011: 171–189

Abstract

Probabilistic methods are providing new explanatory approaches to fundamental cognitive science questions of how humans structure, process and acquire language. This review examines probabilistic models defined over traditional symbolic structures. Language comprehension and production involve probabilistic inference in such models; and acquisition involves choosing the best model, given innate constraints and linguistic and other input. Probabilistic models can account for the learning and processing of language, while maintaining the sophistication of symbolic models. A recent burgeoning of theoretical developments and online corpus creation has enabled large models to be tested, revealing probabilistic constraints in processing, undermining acquisition arguments based on a perceived poverty of the stimulus, and suggesting fruitful links with probabilistic theories of categorization and ambiguity resolution in perception.

Abstract

Good automatic information extraction tools offer hope for automatic processing of the exploding biomedical literature, and successful named entity recognition is a key component for such tools.We present a maximum-entropy based system incorporating a diverse set of features for identifying gene and protein names in biomedical abstracts.This system was entered in the BioCreative comparative evaluation and achieved a precision of 0.83 and recall of 0.84 in the "open" evaluation and a precision of 0.78 and recall of 0.85 in the "closed" evaluation.Central contributions are rich use of features derived from the training data at multiple levels of granularity, a focus on correctly identifying entity boundaries, and the innovative use of several external knowledge sources including full MEDLINE abstracts and web searches.

A system for identifying named entities in biomedical text: how results from two evaluations reflect on both the system and the evaluationsISMB BioLink 2004 MeetingDingare, S., Nissim, M., Finkel, J., Manning, C., Grover, C.HINDAWI PUBLISHING CORPORATION.2005: 77–85

Abstract

We present a maximum entropy-based system for identifying named entities (NEs) in biomedical abstracts and present its performance in the only two biomedical named entity recognition (NER) comparative evaluations that have been held to date, namely BioCreative and Coling BioNLP. Our system obtained an exact match F-score of 83.2% in the BioCreative evaluation and 70.1% in the BioNLP evaluation. We discuss our system in detail, including its rich use of local features, attention to correct boundary identification, innovative use of external knowledge resources, including parsing and web searches, and rapid adaptation to new NE sets. We also discuss in depth problems with data annotation in the evaluations which caused the final performance to be lower than optimal.

Using feature conjunctions across examples for learning pairwise classifiers15th European Conference on Machine Learning/8th European Conference on Principles and Practice of Knowledge Discovery in DatabasesOyama, S., Manning, C. D.SPRINGER-VERLAG BERLIN.2004: 322–333

A System For Identifying Named Entities in Biomedical Text: How Results From Two Evaluations Reflect on Both the System and the EvaluationsDingare, S., Finkel, J., Nissim, M., Manning, C., Grove, C.2004

Abstract

Osteoclast differentiation factor (ODF; also known as osteoprotegerin ligand, receptor activator of nuclear factor kappaB ligand, and tumor necrosis factor-related activation-induced cytokine) is a recently described cytokine known to be critical in inducing the differentiation of cells of the monocyte/macrophage lineage into osteoclasts. The role of osteoclasts in bone erosion in rheumatoid arthritis (RA) has been demonstrated, but the exact mechanisms involved in the formation and activation of osteoclasts in RA are not known. These studies address the potential role of ODF and the bone and marrow microenvironment in the pathogenesis of osteoclast-mediated bone erosion in RA.Tissue sections from the bone-pannus interface at sites of bone erosion were examined for the presence of osteoclast precursors by the colocalization of messenger RNA (mRNA) for tartrate-resistant acid phosphatase (TRAP) and cathepsin K in mononuclear cells. Reverse transcriptase-polymerase chain reaction (RT-PCR) was used to identify mRNA for ODF in synovial tissues, adherent synovial fibroblasts, and activated T lymphocytes derived from patients with RA.Multinucleated cells expressing both TRAP and cathepsin K mRNA were identified in bone resorption lacunae in areas of pannus invasion into bone in RA patients. In addition, mononuclear cells expressing both TRAP and cathepsin K mRNA (preosteoclasts) were identified in bone marrow in and adjacent to areas of pannus invasion in RA erosions. ODF mRNA was detected by RT-PCR in whole synovial tissues from patients with RA but not in normal synovial tissues. In addition, ODF mRNA was detected in cultured adherent synovial fibroblasts and in activated T lymphocytes derived from RA synovial tissue, which were expanded by exposure to anti-CD3.TRAP-positive, cathepsin K-positive osteoclast precursor cells are identified in areas of pannus invasion into bone in RA. ODF is expressed by both synovial fibroblasts and by activated T lymphocytes derived from synovial tissues from patients with RA. These synovial cells may contribute directly to the expansion of osteoclast precursors and to the formation and activation of osteoclasts at sites of bone erosion in RA.

Abstract

We did formative evaluations of several variations to the computation of related articles for non-bibliographic resources in the medical domain.A binary model and several variations of the vector space model were used to measure similarity between documents. Two corpora were studied, using a human expert as the gold standard.Variations in term weights and stopword choices made little difference to performance. Performance was worse when documents were characterized by title words alone or by MeSH terms extracted from document references.Further studies are needed to evaluate these methods in medical information retrieval systems.

Enriching the knowledge sources used in a maximum entropy part-of-speech taggerJoint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora held in Conjunction with the 38th Annual Meeting of the Association-for-Computational-LinguisticsToutanova, K., Manning, C. D.ASSOCIATION COMPUTATIONAL LINGUISTICS.2000: 63–70

Ergativity: Argument Structure and Grammatical Relations, PhD Thesis, StanfordThe revised version has been published by CSLI Publications (see 1996), and this version is not available on the web.Manning, Christopher, D.1994