As part of my internship at Interactive Things, I was offered the opportunity to work on a purely self-directed project. I decided to create a visualization based on the interview collection of Substratum. I was involved in the redesign of the website itself and the idea for a more quantitative or lexical access to the content was part of my initial concept exploration. You can read more about the design process of the website itself on the Interactive Things Internship blog.

Step 1: Analysing the text

Honestly, the world of data mining and natural language processing was something I never worked on before. This was the first challenge for me to confront.

The first step I took to dive into the subject was to look out for tools doing Natural Language Processing. I was actually quite surprised to see many of them out there, some internet based such as Wordle, TagCrowd, NGram Analyser or the LIWC which allows a first basic analysis of the data through the production of word cloud for example. On the other hand, there are many more sophisticated text analysis solutions such as R or WMatrix. Finally a javascript possibility exists: NaturalNode based on the node.js.

With this first insight, I decided to do the first most obvious kind of visualisation: word cloud. I was hoping to identify keywords, which I did, and therefore have a first overall idea of the themes and structure of the corpus. I quickly realised that some of those words didn’t make much sense such as “just” or “lot” to name a few. I actually think that the word cloud comparison, rawly done like this, can be interesting to compare different kind of texts, such as an interview and a literary text for example, to spot differences in the vocabulary used.

After this first analysis I decided to work with R, to be able to clean the corpus and direct a bit the analysis. This choice has been done because of the nature of the R project: open source with an important community and a lot of plugins ranging from probability analysis to text mining. The work done with R, using the t.m. library mainly, was a cleaning one, removing all stopwords (i.e. “a”, “this”, “is”…), uppercase, punctuation… After applying some algorithms on the cleaned corpus I could analyze the new emerging words, some where common to the ones found during the first wordcloud process, other kind of new. I processed in a loop comparing the results with the expected results, refining the corpus wordcloud visualization.

The next step brought me to the use of Gephi to have a network visualization of the corpus thus cleaned, using the same data set used to create the last wordcloud. From this point, I worked in a loop between R and Gephi to refine the network. Gephi revealed to be an interesting tool to see who (interviewee) is linked to what (words), and therefore visualize the proximity of two interviewees. On a visual level the network visualization produced can be considered as aesthetic but on a meaning level I didn’t consider it as being really relevant. Indeed the real meaning of the words is stripped away because they are out of their context. So two interviewees may have used the same words to express totally different ideas. In that case is there still a link between both of them? And if so, shouldn’t we differentiate different kind of links?

Step 2: What happens next?

From this conclusion and after discussing with several persons, I decided to reorient the project to be able to have the meaning of the words back in the loop. Therefore, I focussed the analysis of the corpus on the words and their context of use.

One idea that I developed was to use a sunburst representation. In the center would be the main themes of Substratum. Then each division corresponds to a way his parent is defined. It creates accordingly a path of words revealing one way of thinking a main theme. Ideally the external words are linked to their(s) interviewee(s) and to the questions where they are used, this allowing a full representation of the Substratum corpus.

This brought a technical challenge: it requires a more sophisticated analysis. To do so I used Cortext which instead of analysing the corpus word by word, analyses it by group of words, which put those back in their context. This resulted in a new data table which included some words that had been put aside during the first phase because of their relative lack of meaning. From this new data set, I did a personal analysis of the corpus to build a table of words connected between each other. Then I began a new prototyping phase using D3. I firstly stick to the sunburst visualisation and then did some experiments with tree diagram layout and playing with the data sets.

Step 3: Final Prototype

After three weeks of research and experimentation, came the time of doing the final prototype. To do this I used the following concept: the visualisation would be composed of three areas separated by two circles. The inner circle corresponds to the different interviewees, the outer one to the words present in the Substratum’s corpus. The links in the inner area (yellow) would represent the connections between the different interviewees according to the strength of their speech’s relationships. The links in the second area (red) would show the relation between the interviewees and the words. Finally the links in the outer area would show the relation betwwen the words allowing to have the context of each word. One of the interaction imagined with the visualisation would be the ability to filter by clicking on an interviewee or a word revealing their specific connections. Other features can be imagined such as a search engine where you could play with the visualisation according to your interest. The possibility to navigate to a corresponding part of the full interview by clicking on a word can also be considered.

Although, the final prototype is slightly different from this original concept. It doesn’t have any apparent link and is based on a more linear approach. First no links have been implemented, instated the inner circle, corresponding to the different interviewees, gains some meaning by a difference in the length and the thickness of the different segments. The link between an interviewee and the words he is using is made by clicking on this interviewee and letting play an animation of the words appearing in their chronological order. The user has the opportunity to explore those words manually once the animation is finished. I invite you to try the Substratum Visualisation by yourself.

Final Thoughts

I think this project was really interesting in many ways. First it really made me use methodologies that I have learned during my internship at Interactive Things, especially in term of project and time management. It was also the opportunity to dive into an unknown field for me: natural language processing and data mining. Thanks to this I have had a glimpse of the complexity associated with text analysis.
As a final thought I would say that the project would have been more complete and relevant if a multidisciplinary team including people working in the linguistic field, on a technical side or a theoretical one, would have been involved.

Share this article

Subscribe for more

Related articles

Give Feedback

http://beyond.insofe.edu.in Sai Sree

This is an elaborative &
informative article. To explore & understand Data visualization through R
studio you can check BEyond, short video sessions where Dr. Dakshinamurthy
kolluru introduces the users to R and RStudio, ggplot2, Deducer.