Exercise

More than words: tokenization (2)

The tidytext package lets you analyze text data using "tidyverse" packages such as dplyr and sparklyr. How to do sentiment analysis is beyond the scope of this course; you can see more in the forthcoming Sentiment Analysis and Sentiment Analysis: The Tidy Way courses (due Summer 2017). This exercise is designed to give you a quick taste of how to do it on Spark.

Sentiment analysis essentially lets you assign a score or emotion to each word. For example, in the AFINN lexicon, the word "outstanding" has a score of +5, since it is almost always used in a positive context. "grace" is a slightly positive word, and has a score of +1. "fraud" is usually used in a negative context, and has a score of -4. The AFINN scores dataset is returned by get_sentiments("afinn"). For convenience, the unnested word data and the sentiment lexicon have been copied to Spark.

Typically, you want to compare the sentiment of several groups of data. To do this, the code pattern is as follows.

An inner join takes all the values from the first table, and looks for matches in the second table. If it finds a match, it adds the data from the second table. Unlike a left join, it will drop any rows where it doesn't find a match. The principle is shown in this diagram.

Like left joins, inner joins are a type of mutating join, since they add columns to the first table. See if you can guess which function to use for inner joins, and how to use it. (Hint: the usage is really similar to left_join(), anti_join(), and semi_join()!)

Instructions

100 XP

A Spark connection has been created for you as spark_conn. Tibbles attached to the title words and sentiment lexicon stored in Spark have been pre-defined as title_text_tbl and afinn_sentiments_tbl respectively.

Create a variable named sentimental_artists from title_text_tbl.

Use inner_join() to join afinn_sentiments_tbl to title_text_tbl by "word".

Group by the artist_name.

Summarize to define a variable positivity, equal to the sum of the score field.