Lab 10: Text Mining

YOUR NAME

YOUR PARTNERS NAME

2020-03-11 17:45:11

Step 1: Read in the positive and negative word files

# Create two vectors of words, one for the positive words and one for the negative words. The positive words can be found here: "https://cjacks04.github.io/687/Datasets/positive-words.txt" and negative words here: "https://cjacks04.github.io/687/Datasets/negative-words.txt". You should use the scan() function which reads data into a vector or list from the console or file. You'll need three arguments: (1) the file name/path, (2) the second argument is character(0) which will read the next line as a character (as opposed to integer or some other datat ype), (3) the sep argument to tell R how the data are seperated e.g., \n
# Note that when reading in the files, there might be lines at the start and/or the end that will need to be removed (i.e. you should clean you data if needed).

Step 2: Process in the MLK speech

# Read the MLK text file using the readLines() function. Only the URL is required.
# Inspect the vector above. Some lines are blank "". Remove these.
# Create a term matrix. There are several steps here beginning with creating a vector source and making text transformations. (Check chapter 14 where sba is transformed)
# Create a list of counts for each word

Step 3: Determine how many positive words were in the speech

# Hint: one way to do this is to use the ‘match’ function on the list of words from Step 2 and the positive words in the list from the import.
# sum the total number of words and store the value to "totalWords"
# create a vector "words" that contains all the words in "wordCounts"
# locate which words in "mlk" were positive (appeared in positive-word list)
# calculate the total number of positive words in "mlk" speech (in wordCounts) and assign the number to the variable "pTotal". The which() function on words the vector above will give you the index number.
# view the total number of positive words (95 positive words in the speech)
# view the percentage of positive words (11.29608% of the speech words are positive)

Step 4: Determine how many negative words were in the speech

# Hint: one way to do this is to use the ‘match’ function on the list of words from Step 2 and the positive words in the list from the import.

Step 5: Redo the ‘positive’ and ‘negative’ calculations for each 25% of the speech

# Compare the results (ex. a simple barchart of the 4 numbers). I recommend taking extracting quarters of the speech, storing each quarter in a vector and then conducting the calculations over each quarter.