{"title"=>"Getting started in text mining", "type"=>"generic", "authors"=>[{"first_name"=>"K. Bretonnel", "last_name"=>"Cohen", "scopus_author_id"=>"8314739900"}, {"first_name"=>"Lawrence", "last_name"=>"Hunter", "scopus_author_id"=>"12647593500"}], "year"=>2008, "source"=>"PLoS Computational Biology", "identifiers"=>{"pui"=>"351230849", "sgr"=>"38949105955", "issn"=>"1553734X", "pmid"=>"18225946", "scopus"=>"2-s2.0-38949105955", "doi"=>"10.1371/journal.pcbi.0040020", "isbn"=>"1553-7358"}, "id"=>"34d753e7-dda5-3490-b169-2c87df4bb032", "abstract"=>"INTRODUCTION: Text mining is the use of automated methods for exploiting the enormous amount of knowledge available in the biomedical literature. There are at least as many motivations for doing text mining work as there are types of bioscientists. Model organism database curators have been heavy participants in the development of the field due to their need to process large numbers of publications in order to populate the many data fields for every gene in their species of interest. Bench scientists have built biomedical text mining applications to aid in the development of tools for interpreting the output of high-throughput assays and to improve searches of sequence databases (see 1 for a review). Bioscientists of every stripe have built applications to deal with the dual issues of the double-exponential growth in the scientific literature over the past few years and of the unique issues in searching PubMed/MEDLINE for genomics-related publications. A surprising phenomenon can be noted in the recent history of biomedical text mining: although several systems have been built and deployed in the past few yearsChilibot, Textpresso, and PreBIND (see Text S1 for these and most other citations), for examplethe ones that are seeing high usage rates and are making productive contributions to the working lives of bioscientists have been built not by text mining specialists, but by bioscientists. We speculate on why this might be so below. Three basic types of approaches to text mining have been prevalent in the biomedical domain. Co-occurrencebased methods do no more than look for concepts that occur in the same unit of texttypically a sentence, but sometimes as large as an abstractand posit a relationship between them. (See 2 for an early co-occurrencebased system.) For example, if such a system saw that BRCA1 and breast cancer occurred in the same sentence, it might assume a relationship between breast cancer and the BRCA1 gene. Some early biomedical text mining systems were co-occurrencebased, but such systems are highly error prone, and are not commonly built today. In fact, many text mining practitioners would not consider them to be text mining systems at all. Co-occurrence of concepts in a text is sometimes used as a simple baseline when evaluating more sophisticated systems; as such, they are nontrivial, since even a co-occurrencebased system must deal with variability in the ways that concepts are expressed in human-produced texts. For example, BRCA1 could be referred to by any of its alternate symbolsIRIS, PSCP, BRCAI, BRCC1, or RNF53 (or by any of their many spelling variants, which include BRCA1, BRCA-1, and BRCA 1)or by any of the variants of its full name, viz. breast cancer 1, early onset (its official name per Entrez Gene and the Human Gene Nomenclature Committee), as breast cancer susceptibility gene 1, or as the latter's variant breast cancer susceptibility gene-1. Similarly, breast cancer could be referred to as breast cancer, carcinoma of the breast, or mammary neoplasm. These variability issues challenge more sophisticated systems, as well; we discuss ways of coping with them in Text S1.", "link"=>"http://www.mendeley.com/research/getting-started-text-mining-22", "reader_count"=>527, "reader_count_by_academic_status"=>{"Unspecified"=>5, "Professor > Associate Professor"=>28, "Librarian"=>8, "Researcher"=>106, "Student > Doctoral Student"=>19, "Student > Ph. D. Student"=>111, "Student > Postgraduate"=>32, "Student > Master"=>110, "Other"=>25, "Student > Bachelor"=>42, "Lecturer"=>11, "Lecturer > Senior Lecturer"=>6, "Professor"=>24}, "reader_count_by_user_role"=>{"Unspecified"=>5, "Professor > Associate Professor"=>28, "Librarian"=>8, "Researcher"=>106, "Student > Doctoral Student"=>19, "Student > Ph. D. Student"=>111, "Student > Postgraduate"=>32, "Student > Master"=>110, "Other"=>25, "Student > Bachelor"=>42, "Lecturer"=>11, "Lecturer > Senior Lecturer"=>6, "Professor"=>24}, "reader_count_by_subject_area"=>{"Unspecified"=>12, "Agricultural and Biological Sciences"=>136, "Philosophy"=>6, "Arts and Humanities"=>10, "Business, Management and Accounting"=>16, "Veterinary Science and Veterinary Medicine"=>1, "Chemistry"=>9, "Computer Science"=>164, "Earth and Planetary Sciences"=>5, "Economics, Econometrics and Finance"=>7, "Engineering"=>15, "Environmental Science"=>9, "Biochemistry, Genetics and Molecular Biology"=>18, "Nursing and Health Professions"=>1, "Materials Science"=>2, "Mathematics"=>6, "Medicine and Dentistry"=>47, "Sports and Recreations"=>1, "Pharmacology, Toxicology and Pharmaceutical Science"=>2, "Physics and Astronomy"=>5, "Psychology"=>9, "Social Sciences"=>41, "Linguistics"=>5}, "reader_count_by_subdiscipline"=>{"Materials Science"=>{"Materials Science"=>2}, "Medicine and Dentistry"=>{"Medicine and Dentistry"=>47}, "Social Sciences"=>{"Social Sciences"=>41}, "Sports and Recreations"=>{"Sports and Recreations"=>1}, "Physics and Astronomy"=>{"Physics and Astronomy"=>5}, "Psychology"=>{"Psychology"=>9}, "Mathematics"=>{"Mathematics"=>6}, "Unspecified"=>{"Unspecified"=>12}, "Environmental Science"=>{"Environmental Science"=>9}, "Pharmacology, Toxicology and Pharmaceutical Science"=>{"Pharmacology, Toxicology and Pharmaceutical Science"=>2}, "Arts and Humanities"=>{"Arts and Humanities"=>10}, "Engineering"=>{"Engineering"=>15}, "Chemistry"=>{"Chemistry"=>9}, "Earth and Planetary Sciences"=>{"Earth and Planetary Sciences"=>5}, "Economics, Econometrics and Finance"=>{"Economics, Econometrics and Finance"=>7}, "Agricultural and Biological Sciences"=>{"Agricultural and Biological Sciences"=>136}, "Computer Science"=>{"Computer Science"=>164}, "Business, Management and Accounting"=>{"Business, Management and Accounting"=>16}, "Nursing and Health Professions"=>{"Nursing and Health Professions"=>1}, "Linguistics"=>{"Linguistics"=>5}, "Biochemistry, Genetics and Molecular Biology"=>{"Biochemistry, Genetics and Molecular Biology"=>18}, "Philosophy"=>{"Philosophy"=>6}, "Veterinary Science and Veterinary Medicine"=>{"Veterinary Science and Veterinary Medicine"=>1}}, "reader_count_by_country"=>{"Hong Kong"=>1, "United States"=>30, "Portugal"=>4, "Greece"=>1, "Netherlands"=>3, "Korea (South)"=>2, "China"=>4, "Ireland"=>1, "Brazil"=>3, "Poland"=>1, "Slovenia"=>1, "France"=>2, "Colombia"=>1, "Argentina"=>1, "United Kingdom"=>11, "Kenya"=>1, "Belarus"=>1, "Switzerland"=>2, "Spain"=>5, "India"=>5, "Cuba"=>1, "Canada"=>6, "Venezuela"=>1, "Turkey"=>2, "Belgium"=>1, "Denmark"=>3, "Italy"=>5, "Mexico"=>3, "South Africa"=>1, "Israel"=>1, "Australia"=>3, "Germany"=>4, "Indonesia"=>2}, "group_count"=>26}