Back in 2013, David had done analysis of bicycle trips across Seattle’s Fremont bridge. More recently, Jake Vanderplas (creator of Python’s
very popular Scikit-learn package) wrote a nice blog post on
“Learning Seattle Work habits from bicycle counts” at Fremont bridge.

I wanted to work through Jake’s analysis using R since I am learning R. Please read the original article by Jake to get the full context and thinking behind the analysis. For folks interested in Python, Jake has provided link in the blog post to iPython notebook where you can work through the analysis in Python (and learn some key Python modules: pandas, matplotlib, sklearn along the way)

The R code that I used to work through the analysis is in the following link.

Below are some key results/graphs.
1. Doing a PCA analysis on bicycle count data (where each row is a day and columns are 24 hr bicycle counts from East and West side) shows that 2 components can explain 90% of variance. The scores plot of first 2 principal components indicates 2 clusters. Coloring the scores by day of week suggest a weekday cluster and a weekend cluster

The average bike counts for each cluster and side (East/West) better shows the patterns for weekday and weekend commute (weekday commute peaks in the morning and evening)

While this was not in the original post, looking at the loadings of the first 2 principal components also suggests the weekday vs weekend interpretation of clusters.

Thanks again to Jake Vanderplas for the analysis and illustrating how lot of insights could be gathered from data.

The plot below shows the number of teams vs prize for competitions that offered prizes. There is not much of a trend indicating that prize money is not a key motivator for participants. This is probably not a surprise since participants are motivated by the thought of tackling challenging problems.

(you can access a zoomable plot with tootip here. Use left mouse and drag to zoom. Right click to reset zoom)

Is prize money that is set based on perceived difficulty of the problem?

Here I just used the competition duration as a surrogate for perceived difficulty of the problem by the sponsor. The plot below of prize vs duration does not show a trend indicating that the prize is not related to the duration (if you ignore the heritage health prize competition which is the point in far top right). It is possible that duration is not the right surrogate for problem difficulty and so the previous conclusion may not be correct. Another hypothesis could be that the prize is set based on estimated value of solving the problem and not necessarily based on how hard the problem is.

(you can access a zoomable plot with tootip here. Use left mouse and drag to zoom. Right click to reset zoom)

Which knowledge competitions are popular?

The bar graph of number of teams in knowledge competitions indicates that the two most popular competitions for learning are “Titanic Machine Learning” and “Bike Sharing Demand”.

What is the best rank that a user has achieved in a country (among top 500 ranks)?

The googleVis geo chart below shows the best rank of a user by country.

(you can access a zoomable plot with tootip here. Use left mouse and drag to zoom. Right click to reset zoom)

How are the users (among top 500 ranks) distributed across countries?

The googleVis geo chart below shows the number of users by country.

(you can access a zoomable plot with tootip here. Use left mouse and drag to zoom. Right click to reset zoom)

I had some fun doing this post and I hope you have some fun reading it. Happy Holidays.

Mangalyaan is the spacecraft of Indian Space Research Orgnization’s Mars Orbiter Mission that entered the orbit of Mars last week. There were several tweets in Twitter with hashtag #Mangalyaan about it last week. I wanted to use R to explore those tweets. Tiger Analytics had done an interesting post on this topic last year when Mangalyaan launched. I found their analysis to infer topics particularly interesting. I do hope they repeat their analysis with the latest tweets. My goals and methods of analysis here are much more basic. I wanted to do the following:

The full code and explanation is in the following location. I was able to extract about 1000 tweets spanning 4 days from Sep 23, 2014 to Sep 26, 2014 and used it for the analysis below. All the analysis below should be viewed in the context that it is based on a small sample size. The word cloud of the frequent terms in the tweets is:

Next, I used R package topic Models with number of topics set to 5 (no particular reason) and got the following result for top 10 words in each topic
I had done only basic preprocessing and ran the model with default parameters. Better preprocessing and model parameter tuning might give better results.

Applying hierarchical clustering on frequent terms gives the following grouping:

I found that igraph package has some easy to use functions for community detection and plotting. Here the co-occurence of words across tweets is used to construct a graph and the community detection algorithm is applied to that graph. These are plotted both as a dendrogram and a graph plot.

I wanted to use R to explore hotel review data. I chose to explore reviews for 3 hotels from Trip Advisor. First, I had to scrape the review data. I have described how I scraped the data here. I used the extracted review data and did the following exploratory analysis:
* Check ratings over time
* Check the frequent words in the top quotes for each review grouped by star rating
* Check if I can find any themes in reviews with simple k-means clustering

I think my analysis was probably a bit simplistic. Right now I didn’t find anything non-obvious from this exploratory analysis. But it was still a fun exercise. In the future, I will explore how topic model packges work with this data.

I recently came across a short tutorial by Peter Norvig on natural language processing using some interesting examples. I found it very interesting and really liked Peter Norvig’s explanations and the fact that it was all in ipython notebook for somebody to work alongside reading the tutorial. I wanted to see how a R version would work and so this is my attempt at a R version for the first part of his tutorial on spelling corrector. The knitted document is in the following location.

I wanted to use R to explore how to access and visualize census data, home value data and school rating data using Indianapolis metro area as an example. My learning objectives here are to learn how to use R to:

Make choropleth maps at zip code level

Process data in XML or JSON format that is returned when using API provided by websites (here I chose Zillow and education.com as an example)

The data that I wanted to overlay over the Indianapolis map was median age, median income, median home listing price/value per sq feet, and school rating at the zip code level. I got the census data using US census data API, zip level home demographic data using Zillow demographic API, school rating data using education.com API. (Disclaimer: This is only a one off use of Zillow API and education.com API. I believe I am in compliance with the terms of use of these API for Zillow and education.com. However, if somebody notices any violation of terms of use of these API, please bring it to my attention and I will remove this material). I should also note here that there are several nice interactive tools and widgets on the web that get the same information. My purpose here is just to play with R to extract and visualize this data.

Recently I was looking to buy a used car. I was using the Edmunds website to estimate the true market value of a used car. It has a nice click through interface where it guides you to input make, model, year etc. and will give an estimated value. While the interface is nice, it would be nice to get a broader view across years, models etc. Luckily, Edmunds also has a developer API which is nicely documented. Here I am listing my basic attempts at using the developer API to get some data into R and get exploratory plots of car value by year. For using this API, you need to get a API key (specifically the vehicle API key is used here).

The plots below show the market value across styles for a given model/make of a car by year. The y axis in these plots is the Edmunds estimated True Market Value for a typically equipped vehicle (not sure how they define it). The bars in the graph span the range of market value.

TMV for Honda Accord

TMV for Toyota Corolla

Edmunds also makes several adjusments for car condition, car mileage and additional car options. The plot below show the TMV adjustment for each option for a 2008 Toyota Corolla (LE 4dr Sedan (1.8L 4cyl 4A)