May 2016: Scripts of the Week

With several new datasets uploaded to Datasets this month, we saw a great number of exceptional scripts created. In this month's blog featuring the May 2016 Scripts of the Week, you'll hear about four that the team selected for their quality insight and analysis including:

What motivated you to create it?

I was reviewing the data in the competition and feeling a bit stuck on how to approach it. I have a background in earth observation/remote sensing using satellite and aerial data, but the contest goals are pretty unusual in this contest and the imagery is different from what I prefer to work with (multi- or hyper-spectral imagery). Given that I didn't have a clear approach or purpose in mind, I decided to just explore the data iteratively as a brainstorming process.

My process has been pretty simple - I literally am just setting one pomodoro aside here and there to play with the data in some interesting way I haven't yet, then I save the results. I opted to make it public as I have fairly limited time to spend and didn't want any of my brainstorming to vanish into the aether if I didn't end up in a place where I had a viable approach to tackling the competition problem. I know a lot of Kagglers aren't familiar with image data or are intimidated by it, so maybe it would give them a chance to get their feet wet. Other more motivated people with a little more time to tackle the problem could maybe feed it into their manual labeling or feature engineering process.

What have you learned from the code/output?

I think it's likely a lot of what I've done may not generalize across the dataset, but I think the more promising approach I've included is using the HSV color space transform to extract information about shadows/variation in illumination. I still think trying to track illumination changes between images might be an important cue to use in an automated approach. Of course, sun position changes dependent on a whole lot of other information (correct orientation and referencing of the image, influence of sensor angle, period of changes in sun angle within day or between days as described by analemma, etc.) and usually you already have that data in hand with typical sensor, so there aren't a lot of methods out there for reverse engineering it that I'm aware of or that I've been able to dig up through basic research. That approach may just turn out to be not viable in the end.

What other questions would you love to see answered or explored in this dataset?

The coregistration/mosaicking/stitching work is probably the most important other thing to tackle, as you want to compare differences between features that you can identify in multiple images (whether you're manually generating filters or extracting features like I am, or if you're taking a feature learning approach like in a convolutional neural network). There are a few scripts showing up where people are trying this out and I'm likely to add some material influenced by their approaches, or based on other research literature I come across.

What motivated you to create it?

I picked the CFPB dataset because it was fairly new and I wanted to be one of the first ones to take a crack at it. When looking at the data, I remembered a radio program I where they talked about payday loans and whether they deserved their bad reputation. With that in mind, I decided to explore that part of the data to see what I could uncover.

What did you learn from the code/output?

I had not really worked with non-numerical data before, so this was a great opportunity to think about what I could do beyond time series and summary statistics. Looking into the content of the consumer complaints by searching for keywords was new for me.

As far as the actual results, it looks like payday loan users don't complain to the CFPB too much; whether that is because they don't have complaints or don't know where to lodge them, though, I can't say.

What other questions would you love to see answered or explored in this dataset?

I plan on looking at how individual companies respond to consumer complaints and whether there is any indication of improvement over time, in terms of a decreased volume of complaints for a specific issue/sub-issue.

D3.js a popular and powerful JavaScript library for producing dynamic interactive visualisation in web browsers. There are a number of excellent packages that seamlessly implement D3 in R. The purpose of this script is to demonstrate the power of some of the most popular packages, from merely a couple lines of code, including:

The dataset is straightforward with pretty much all dummy-like variables describing the tools used by researchers (for data analysis, manuscript, etc.) in different disciplines. A natural question would be: Would researchers from different disciplines have different preference in their research tools? Hence the title of the script: Swordsman and their swords.

Inspired by the Kaggle script from Georgi, I also included a static network chart to illustrate the relative number of people using different tools in the dataset. It is not surprising to see how dominant Word and Excel are especially for social science researchers. It is also interesting to see the popularity of Latex/Matlab/Github among Physicists and Engineers and Dryad/Figsdata/R to life scientist, etc.

Visualisation has an important role in data analysis. Because human brain is much more comfortable in reading shapes/colours/size than boring tables. Interactivity from these packages has raised visualisation to the next level by adding the fourth dimension, and people are comfortable with this level of interactivity because it is already everywhere on the internet. A number of brilliant developers are working on bringing more of these htmlwidget type tools to R and Python, which has made our life a lot easier and much more fun. Thanks to them.

What motivated you to create it?

I was motivated by an assignment from a course on Graphs and Big Data but the assignment was merely to identify a complex network from real life. I had been doing some data science for humanitarian aid and disaster relief (Nepal earthquake, Ebola virus) and wanted to explore the recent Syrian migration throughout Europe, but I found some excellent analysis on that. So I looked up some datasets from Kaggle competitions and the ISIS Tweet data had just posted and I thought it would be interesting to explore. It is especially suited for social network analysis.

What did you learn from the code/output?

What I learned form the code was the limitations that graph analytics has as far as scalability is concerned, as well as the depth of knowledge we can extract from Twitter and the inter-relationships of the data there.

What other questions would you love to see answered or explored in this dataset?

I want to generalize the model to include other classes of nodes, such as tweet time, URL, @, and make the model directional.