```{r setup, cache=FALSE, echo=FALSE, global.par=TRUE}
library("RColorBrewer") # brewer.pal
library("knitr") # opts_chunk
# terminal output
options(width = 100)
# color palette
palette(brewer.pal(6, "Set1"))
# code chunk options
opts_chunk$set(cache=TRUE, fig.align="center", comment=NA, echo=TRUE,
highlight=FALSE, tidy=FALSE, warning=FALSE, message=FALSE)
```
Part-of-Speech Tagging
======================
*Patrick O. Perry, NYU Stern School of Business*
### Computing environment
We will use the following R packages.
```{r}
library("jsonlite")
library("coreNLP")
library("Matrix")
library("NLP")
library("openNLP")
library("stringi")
```
To ensure consistent runs, we set the seed before performing any
analysis.
```{r}
set.seed(0)
```
### Data
We will analyze a subset of [Yelp Academic Dataset][yelp-academic]
corresponding to reviews of 500 restaurants nearest to Columbia University (as
of October 15, 2012). To get this data, take the following steps:
1. Visit [Yelp's developer page][manage_api_keys] to create a Yelp API
account and log in to your account.
2. Visit [Yelp's academic dataset page][yelp-academic], then click on the
"Download the dataset" button (in the "Access" section). The button
will only be visible after logging in to your Yelp API Account.
At this point, you should have a file called `yelp_academic_dataset.json.gz`.
Run the `01_make_json.py` and `02_subset_nyc.py` scripts, available from the
course webpage, to generate `yelp-nyc-business.json` and
`yelp-nyc-review.json`. You will need Python version 3.4 or later.
After downloading and pre-processing the data, you can load it into R. First,
a random sample of 50 businesses.
```{r}
nbus