IPython

This was a hackathon + workshop conducted by Analytics Vidhya in which I took part and made it to the #1 on the leaderboard. The data set was straight-forward and quite clean with only a minor need for missing value treatment. This post will might be useful for people who want a walk-through on the steps involving data munging and developing machine-learned models.

The workshop ended with a basic hackathon with data given on age, education, working class, occupation, marital status and gender of individuals and one had to predict the income bracket of these individuals.

I’ve posted the data and my code and solutions in this GitHub repo. An IPython Notebook has also been shared.

I approached the problem first by attempting some feature engineering (other than missing value treatment) on the data, and then ran a basic logistic classifier and a random forest classifier. However it turned out that these models performed better without feature engineering, which shows the dataset was already quite clean and informative to begin with for this competition.

Here’s a quick example case for implementing one of the simplest of learning algorithms in any machine learning toolbox – Linear Regression. You can download the IPython / Jupyter notebook here so as to play around with the code and try things out yourself.

I’m doing a series of posts on scikit-learn. Its documentation is vast, so unless you’re willing to search for a needle in a haystack, you’re better off NOT jumping into the documentation right away. Instead, knowing chunks of code that do the job might help.

Like this:

Deep learning became a hot topic in machine learning in the last 3-4 years (see inset below) and recently, Google released TensorFlow (a Python based deep learning toolkit) as an open source project to bring deep learning to everyone.￼

Interest in the Google search term Deep Learning over time

If you have wanted to get your hands dirty with TensorFlow or needed more direction with that, here’s some good news – Google is offering an open MOOC on deep learning methods using TensorFlow here. This course has been developed with Vincent Vanhoucke, Principal Scientist at Google, and technical lead in the Google Brain team. However, this is an intermediate to advanced level course and assumes you have taken a first course in machine learning, or that you are at least familiar with supervised learning methods.

Google’s overall goal in designing this course is to provide the machine learning enthusiast a rapid and direct path to solving real and interesting problems with deep learning techniques.

I had downloaded Canopy at the insistence of the instructors of MIT’s introductory course on computer science using Python. That said, I rarely ever used it. I’ve all along been working on Python using a text editor and command line only. I also downloaded Anaconda and started working on IPython since I began working on a new machine learning MOOC offered by the University of Washington via Coursera. Anaconda is awesome! It has all the best scientific libraries and I love IPython compared to PyCharm or Canopy, which pale in comparison to IPython, especially if you’re using Python for Machine Learning.

Anyway, I was working on IPython, trying to importmatplotlib, when I got the following ImportError:

I noticed that the matplotlib library was trying to be accessed in Canopy’s Enthought directory. Since I never used or liked Canopy anyway, I decided to uninstall, bitch!

Step by step process of uninstalling Canopy from Linux:

1) From the Canopy preferences option in the Edit menu, mark off Canopy as your default Python (this step is not available on very early versions of Canopy).

5) Delete the file “locations.cfg” from each user’s Canopy configuration / preferences directory. For complete Canopy removal, delete this directory entirely; if you do so, the user will lose individual preferences such as fonts, bookmarks, and recent file list.

cd ~/.canopy
cd ..
rm -rf .canopy

6) If you are uninstalling completely, edit the following files to delete any lines which reference Canopy (usually, the Canopy-related lines will have been commented out by step 1 but on some system configurations the lines might remain):

I have finally embarked on my first machine learning MOOC / Specialization. I love Python, and this course uses Python as the language of choice. Also, the instructors assert that Python is widely used in industry, and is becoming the de facto language for data science in industry. They use IPython Notebook in their assignments and videos.

The specialization offered by the University of Washington consists of 5 courses and a capstone project spread across about 8 months (September through April). The specialization’s first iteration kicked off yesterday.