Writing a conference abstract the data science way

Conferences are an ideal platform to share your work with the wider community. However, as we all know, conferences require potential speakers to submit abstracts about their talk. And writing abstracts is not necessarily the most rewarding work out there. I have actually never written one so when asked to prepare abstracts for this year’s conferences I didn’t really know where to start.

So, I did what any sane person would do: get data. As Mango has organised a number of EARL conferences there are a good deal of abstracts available, both accepted and not accepted. In this blogpost I’m going to use the tidytext package to analyse these abstracts and see what distinguishes the accepted abstracts from the rest.

Disclaimer: the objective of this blogpost is not to present a rigorous investigation into conference abstracts but rather an exploration of, and potential use for, the tidytext package.

The data

I don’t know what it’s like for other conferences but for EARL all abstracts are submitted through an online form. I’m not sure if these forms are stored in a database but I received them as a PDF. To convert the PDFs to text I make use of the pdftotext program as outlined in this stackoverflow thread.

The abstracts with a higher number of words have a slight advantage but I wouldn’t bet on it. There is something to be said for being succinct. But what really matters is obviously content so let’s have a look at what words are commonly used.

Certainly an interesting graph! It may have been better to show the proportions instead of counts as the number of abstracts in each category are not equal. Nevertheless, the conclusion remains the same. The words “r” and “data” are clearly the most common. However, what is more interesting is that abstracts in the “yes” category use certain words significantly more often than abstracts in the “no” category and vice versa (more often because a missing bar doesn’t necessarily mean a zero observation). For example, the words “science”, “production” and “performance” occur more often in the “yes” category. Vice versa, the words “tools”, “product”, “package” and “company(ies)” occur more often in the “no” category. Also, the word “application” occurs in its singular form in the “no” category and in its plural form in the “yes” category. Certainly, at EARL we like our applications to be plural, it is in the name after all.

There is one important caveat with the above analysis and that is to do with the frequency of words within abstracts. The overall frequencies aren’t really that high and one abstract’s usage of a particular word can make it seem more important than it really is. Luckily the tidytext package provides a solution for that as I can now easily calculate the TF-IDF score.

Note that I have aggregated the counts over the Acceptance category as I’m interested in what words are important within a category and not within a particular abstract. There isn’t an obvious pattern visible in the results but I can certainly hypothesise. Words like “algorithm”, “effects”, “visualize”, “ml” and “optimization” point strongly towards the application side of things. Whereas words like “concept”, “objects” and “statement” are softer and more generic. XBRL is the odd one out here but interesting in it’s own right, whoever submitted that abstract should perhaps consider re-submitting as it’s quite unique.

Next Steps

That’s it for this blogpost but here are some next steps I would do if I had more time:

Add more abstracts from previous years / other conferences

Analyse combination of words (n-grams) to work towards what kind of sentences should go into an abstract

The content isn’t the only thing that matters. By adding more metadata (time of submission, previously presented, etc.) the model can be made more accurate

Try out topic modeling on the accepted abstracts to help with deciding what streams would make sense

Train a neural network with all abstracts and generate a winning abstract [insert evil laugh]

Conclusion

In this blogpost I have explored text data taken from abstract submissions to the EARL conference using the fabulous tidytext package. I analysed words from abstracts that were accepted versus those that weren’t and also compared their TF-IDF score. If you want to know more about the tidytext package come to the Web Scraping and Text Mining workshop my colleagues Nic Crane and Beth Ashlee will be giving preceding the LondonR meetup this Tuesday the 28th of March. Also, if this blogpost has made you want to write an abstract, we are still accepting submissions for EARL London and EARL San Fransisco (I promise I won’t use it for a blogpost).