Grabbing Some Data

Enter a search term. In this case we are trying "Mills College" (with the quotation marks to see if this will eliminate accidental juxtapositions of the words "mills" and "college" — but note that it also means we may miss references just to "Mills"). This yields the following.

Then we export the data to Excel (as a txt — tab delimited text — file) and start Excel to import that data.

And it looks like this (after a little bit of column resizing).

We scan down the data to see it all looks in order and notice one entry at the bottom that doesn't look right.

On closer inspection we see this is the result of a clever twitter user who added line breaks to her/his tweet. Hopefully, this doesn't happen too often. Later we might want to go into the raw TXT file and see whether these are ordinary <CR><LF> or another character.

Data Cleaning

After fixing that one row, the data is in pretty good shape. Suppose we want to do a simple word count. Here's how we can proceed.

Select column containing tweets and copy.

Open MS Word and paste.

Select all (ctrl-A) and replace all spaces with carriage returns (in the replace box we just type "caret p" or ^p. This puts every word on separate line.

Next we want to do a little clean up. Replace all commas, colons, periods, exclamation marks, and hyphens with nothing (not blank, nothing).

Select all, copy.

Simple Frequency Analysis

Go back to Excel and paste starting in cell A2 in a new worksheet. Label the column (in A1) "words."

Select the column, INSERT>PIVOT TABLE. Then drag "words" to both the row labels and the data values boxes as shown by the pink arrows.

Then we select the pivot table, copy, paste-special values.

Then sort the two columns of data on the count column from largest to smallest.

Then draw histogram of top 25 or so (here we have omitted "Mills College")