Plotting Spatial data in R

Plotting Spatial data in RVisualize neighborhoods with high concentration of businesses in San FranciscoAditya TandelBlockedUnblockFollowFollowingMar 8I recently got an opportunity to work on spatial data and wanted to share my analysis on one such dataset.

The data consisted of various registered business in the San Francisco Bay Area which can be found here.

An updated version can be found here.

Spatial data pertains to data which is associated with locations.

Typically its described by a coordinate reference system, latitude and longitude.

The goal of this exercise was to find pockets of neighborhoods in San Francisco with high concentration of businesses.

You would need to get a key from Google’s Geolocation API to use their maps.

I used the ggmap package in R to plot this data.

Then I narrowed down my analysis on one particular high concentration neighborhood to see how businesses were dispersed within that area.

First…Quick scan of the datasetstr(biz)head(biz, 25)summary(biz)For the purpose of this exercise I was only concerned with the Neighborhoods, address, dates and most importantly the location columns which contained latitude and longitude data for each business.

Names of the businesses and their codes (which are assigned by the city for registered businesses) were not considered for now.

After doing basic data cleaning activities such as eliminating duplicates and nulls I extracted information only pertaining to the city of SF and eliminated records related to adjoining cities in the Bay Area.

Identify data pertaining to San Francisco onlyThere were a few ways I could go about achieving this; filter dataset based on city or by business.

location or by zip codes.

I chose to use the zip code logic as the other two fields had inconsistent patterns of the San Francisco city name which could easily be missed out.

I have however included commands for all three methods of filtering this data.

sf_biz_active_zip$Longitude <- gsub(sf_biz_active_zip$Longitude, pattern = "[)]", replacement = "")I then converted latitude and longitude variables from discrete to continuous and stored them as numerical variables as this helps when plotting/visualizing data and to avoid errors.

sf_biz_active_zip$Latitude <- as.

numeric(sf_biz_active_zip$Latitude)sf_biz_active_zip$Longitude <- as.

numeric(sf_biz_active_zip$Longitude)Now the fun part…Visualization the dataThe resultant dataset had 88,785 records which needed to be plot on a Google map.

Interpreting these many records on a map would be overwhelming to say the least!.Although sampling would be one way to proceed, I instead tried to find out the top 10 neighborhoods which had the largest number of businesses and plot one such neighborhood on the map.

viz <- sf_biz_active_zip %>% group_by(Neighborhoods.

Analysis.

Boundaries) %>% tally() %>% arrange(desc(n))col.

names(viz)[2] <- “Total_businesses”viz <- viz[1:10, ]I then created a histogram of these top 10 neighborhoods.

5))fin_plot <- fin_plot + ggtitle("Top 10 neighborhoods by business count", size = 2)Let’s look at the Financial District/South Beach neighborhood in more detail since it has the maximum number of active businesses.

Registering Google Maps keyI installed the “ggmap”, “digest” and “glue” packages then registered with Google API to get the the Geolocation API key.

District and South Beach") + xlab("Longitude") + ylab("Latitude") + theme(plot.

title = element_text(hjust = 0.

5))")ConclusionAreas around Powell Street bart station, Union Square and Embarcadero bart station have a relatively large number of businesses while as areas around South Beach and and Lincoln Hill are sparse populated.

Similarly other individual neighborhoods can be plotted to understand the distribution of businesses there.