Understanding Text Mining: 4 Need-to-Know Terms and Their Definitions

As the use of text mining becomes more widespread, now is the time for information managers to make sure they understand the basics.

Text mining, the process of deriving high-quality information from text materials using software, helps researchers identify patterns or relations between concepts that would otherwise be difficult to discern. The result is faster discovery and smarter decision-making.

Looking for a place to start? Here are four key text mining terms every information manager should know:

XML

Short for Extensible Markup Language, XML is an information exchange standard designed to improve usability, especially when the data is interpreted by software. In other words, it is a more readily machine-readable version of a document. XML tends to be the preferred input method for semantic or text and data mining technology, as well as other processing software.

When acquiring full-text articles, researchers are usually able to access only PDF format, necessitating conversion into XML for text mining, This can be an arduous and error-prone process.

Semantic enrichment

Semantic enrichment describes the process of adding a layer of meaning to raw content. This enhancement of content with information about its meaning thereby adds structure to unstructured information, making the content easier to synthesize and process further. For example, a scientific article can be enriched by adding in-line annotations or tags describing the genotypes/phenotypes, diseases, drugs, mechanisms of action, and other biomedical concepts mentioned within. Semantic enrichment is a key enabler of the various strategic initiatives undertaken by informatics and information management professionals.

TDM rights

Content is associated with a variety of rights. Information management professionals and librarians will be familiar with copyright licensing, reproduction rights organizations, and other frameworks and organizations that enable content consumers to use, share, and disseminate information while respecting copyright.

As may be expected, there are a number of copyright-sensitive acts that go hand-in-hand with the text and data mining (TDM) process. Content may be copied, stored, annotated or enriched, and otherwise scanned to produce a useable research output. In most cases, commercial TDM rights are not included in standard subscription agreements. Publishers may make a standard or special set of ‘TDM rights’ available as part of their subscription agreements, or as additional incremental rights.

Machine learning

Machine learning can be an approach to synthesize raw or semantically enriched content to yield insights.

Machines can be instructed to process information in many ways. One way is to apply strict rules that attempt to cover every instance that is likely to come up. For instance, one rule might be: when A is the input, B is always the output. But while this is simple in theory and easy for humans to understand, it can be difficult to maintain, scale, and capture value from this process in practice.

Machine learning is another way for machines to process information. In this case, the system is ‘trained’ by way of example, rather than given rules. For example, a system that is meant to classify images into either pictures of humans or pictures of cats would be given a set of images and told they are humans, and another set and told they are cats. From there, the system can move on to classifying other images, with feedback being given continually. It is through this feedback that the system is able to constantly adjust to improve its classification ability and yield greater insights.

Text mining and semantic enrichment are increasingly being used as data processing techniques to enable machine learning programs. Here are a few examples of how machine learning is helping the industry to evolve.

Author: Mike Iarrobino

Mike Iarrobino is CCC's product manager for content and rights workflow solutions RightFind® XML for Mining and RightFind Music. He has previously managed marketing technology and content discovery products at FreshAddress, Inc., and HCPro, Inc. He speaks at webinars and conferences on the topics of content discovery and data management, and loves to get into conversations about the nature of free will.