Last Update

Introduction

"...text mining is the indexing of content. Words that are part of a fixed vocabulary are found within a text and extracted to create an index that shows where in the text each word was found. The index can be used in the traditional way to locate the parts of the text that contain those words. The index can also be used as a database and analysed to discover patterns: for example, how often certain words occur. In simple terms, text mining is the process that turns text into data that can be analysed." — Clark, 2013

Text-mining (i.e., data mining, content-mining) refers to a process of discovering and extracting text-related content from unstructured, miscellaneous data. Text-mining is often mentioned in the context of several information-age trends such as big data, bioinformatics, data curation, e-Science and the semantic web. Currently, there are a number of social media monitoring tools that perform various types of text-mining activities. In 2013, the US Government announced that it extracts data from the e-mails and telephone calls of American citizens, referring to this process (which includes text-mining) as their metadata program.

Typically, text-mining comprises three major activities: 1) information retrieval (IR) to gather relevant unstructured text among heterogeneous databases, documents and websites, 2) information extraction (IE) to identify and extract entities, facts and relationships among those entities, and 3) data-mining to find associations among the information extracted in the various texts located. The goal of text-mining is to extract and discover knowledge hidden in text by identifying concepts, extracting facts/relationships in texts, discovering implicit links and generating hypotheses. One of the main reasons text-mining may be important is to deal with information overload created by blogs, wikis, clinical data, surveys, heterogeneous databases and the web. Text-mining is especially useful in areas where large collections of data and information in documents are located. Some of the scientific applications have been developed because of text-mining are drug discovery applications, predictive toxicology, competitive intelligence, patent searching, and so on.

Other reasons why text-mining may be important are:

Biomedical science is inundated with data, datasets and information of various kinds

Much of the information is in an unstructured format (text)

There are as many text types, genres, domains as there are documents

Some of the information is in a semi-structured format (XML + text)

Some of the information is in a structured format (databases)

Biomedical science researchers need to make sense of data

Biomedical researchers and health librarians need to manage this information and knowledge effectively

Text-mining can be used to improve indexing which is essential for findability; however, text-mining can create indexes more efficiently because it is machine-aided indexing

Questions for librarians

The rise of data and its concomitant uses, curation and management, is a growing trend in academic libraries. However, rather than wait for your library organization to hire a data librarian or to create a data repository, why not try to introduce some data science skills (or exercises) into your library workshops?