Demystifying “big data” part 2: Text mining

This piece is the second of four I will publish this spring in which I describe particular techniques used to make sense of or mine large data sets. This post covers text mining.

Our world is overrun with words. With the amount of email, text messages, social media and articles growing by the day, the quantities of available text are immense and textual data sets can become enormous. This creates a growing challenge: textual data overload. There is a way to try and navigate this abundance of text that is known as text mining.

When we want to examine a large amount of text, reading it and making sense of it by hand may not be an option due to its sheer size. Textual data sets today may include millions upon millions of characters. It is not feasible nor effectual to sort it out the way we might have decades ago when we were limited in the amount of text we could access at one time. Text mining is a strategy for analyzing textual data archives that are too large to read and code by hand, as well as identifying patterns within textual data that cannot be easily found using other methods.[i] It finds its roots in early electronic card catalogues in libraries[ii] and search functions on computers to identify which documents contained a word of interest[iii].

Existing studies using text mining have tapped into many fascinating realms, such as one that looked at tweets on Ebola to learn more about public concerns in order to inform communication strategies.[v] In another, researchers tackled coordination in natural disasters. Crowdsourcing is seen as having potential yet making sense of all the data that would pour in is a huge challenge, to which text mining is identified as a possible solution.[vi] There also exists biomedical text mining, wherein researchers attempt to extract information regarding biological entities, such as genes and proteins, phenotypes or even more broadly biological pathways.[vii] Genes hold so much information within them that trying to identify certain bits and pieces by hand efficiently is nigh on impossible; text mining attempts to resolve this conundrum.

In terms of my own research, I travelled to Colombia and spent two weeks conducting interviews with three of my Colombian colleagues. Together, we conducted 45 interviews with coffee farmers, which amounted to approximately 60 total hours of recorded interviews. Three Colombian students spent three weeks transcribing these interviews, and now we have many hundreds of pages of interview transcripts. Our goal is to extract meaning from these interviews, and one way to do this could be through text mining. Rather than examining individual components of the interviews, we may use text mining to look at the data set as a whole. For instance, we could run all this text through a software program that could analyze what words (concepts) are used with most frequency and the relationships between those two concepts. Perhaps, when looked at in the aggregate, we may see things emerge that we hadn’t thought to pay attention to or that our infallible human brains had missed as we were reading through the many pages.

While text mining is a very powerful tool that enables us to sift through datasets that would otherwise be impossibly large, it is still a developing field and there is still some ambiguity around the different applications of this approach. While this is a challenge, it is also an opportunity and invitation to lean into this area and start to tap into an understanding of the limitations and possibilities of such an approach. For now, there are a couple of points to keep in mind when it comes to this method for handling big data. First, in text mining, there are many decisions to be made during the research process and the justification for each of these decisions matters – a lot. For instance, preprocessing alone comes with a wide array of decisions. Preprocessing is when you clean up and standardize text prior to analysis. During this process, there are many components to consider. What words will you delete from the text and which will be subsequently excluded from analysis? Will you remove numbers from the text? Will you “stem’ the words (identify the root of the word and convert all words to the root, such as “driving” becoming “drive”)? A second matter to keep in mind is the following; a researcher may have to run several analyses, making tweaks to the preprocessing and analysis along the way. With such large data sets, it may not be possible to predict what decisions will be best until you have seen what emerges from ‘the other side.’ Trial and error is often just fine. However, beware; it can stray into the realm of coaxing the data to look like something it is not.

To conclude, text mining is an innovative approach for making sense of large textual data sets that would otherwise be impossible to analyze in their entireties. With growing corpora of text made available to researchers around the world, it is a valuable way to attempt to take advantage of these large repositories of data and to attempt to extract meaning from within them. It is still a young field, and there is much potential for growth and exploration within it.