Pages

Tuesday, January 28, 2014

While the exact definition of "Data Scientist" continues to elude us, the job requirements seem to heavily include machine learning skills. They also include a wide range of other skills, ranging from specific languages, frameworks, databases etc, to data cleaning, web scraping, visualizations, mathematical modeling and subject matter expertise. (This breakdown will be the subject of a future post, as I was having some trouble with my web scraper ;))

So for the typical "Data Scientist" role, many organizations want PhD level academic training plus an assortment of nuts and bolt programming or database skills. Most of these job requirements are like a rich and complex mix of "can't find the right candidate" (aka Unicorn). So, as an extension to the Data Science Venn Diagram V2.0, I thought it would be helpful to try to clarify and make some important distinctions regard Machine Learning skills.

Back in the 2002-2003 time frame, I spent a bunch of time trying to code my own Neural Networks. This was a very frustrating experience because bugs in these algorithms can be especially difficult to find and it took time away from what I really wanted to do, which is building applications using machine learning. So I decided back then to use well tested and fully debugged library algorithms over clunky home grown algorithms whenever possible. These days there are so many powerful and well tested ML libraries, why would anyone write one from scratch? The answer is, sometimes a new algorithm is needed.

First, some definitions will help clarify:

ML Algorithm: A well defined, mathematically based tool for learning from inputs. Typically found in ML libraries. Take the example of sorting algorithms: BubbleSort, HeapSort InsertionSort, etc. As a software developer, you do not want or need to create a new type of sort. You should know which works best for your situation and use it. The same applies to Machine Learning: Random Forests, Support Vector Machines, Logistic Regression, Backprop Neural Networks etc, are all algorithms which are well known, have certain strengths and limitations and are available in many ML libraries and languages. These are a bit more complicated than sorting, so there is more skill required to use them effectively.

ML Solution: An application which uses one or more ML Algorithms to solve a business problem for an organization (business, government etc).

ML Researcher/Scientist: PhD's are at the top of the heap. They have been trained to work on leading edge problems in Machine Learning or Robotics etc. These skills are hard won and are will suited for tackling problems with no known solution. When you have a new class of problems which require insight and new mathematics to solve, you need an ML Researcher. When they solve a problem a new ML Algorithm will likely emerge.

ML Engineer: Is a sharp software engineer with experience in building ML Solutions (or solving Kaggle problems). The ML Engineer's skills are different from the ML Researcher. There is less abstract mathematics and more programming, database and business acumen involved. An ML Engineer analyzes the data available, the organizational objectives and the ML Algorithms known to operate on this type of problem and this type of data. You can't just feed any data into any ML Algorithm and expect a good result. Specialized skills are required in order to create high scoring ML solutions. These include: Data Analysis, Algorithm Selection, Feature Engineering, Cross Validation, appropriate scoring and trouble shooting the solution.

Data Engineer: A software engineer with platform and language specific skills. The Data Engineer is a vital part of the ML Solution team. This person or group does the heavy lifting when it comes to building data driven systems. The are so many languages, databases, scripting tools, operating systems each with its own set of quirks, secret incantations and performance gotchas. A Data Engineer needs to know a broad set of tools and be effective in getting the data extracted, scraped, cleaned, joined, merged and sliced for input to the ML Solution. Many of the skills needed to manage Big Data, belong in the Data Engineer category.

Sunday, January 26, 2014

Almost everyone would love to predict the Stock Market for obvious reasons. People have tried everything from Fundamental Analysis, Technical Analysis, and Sentiment Analysis to Moon Phases, Solar Storms and Astrology.

However, unless you are in a position to front run other people's trades, like High Frequency Trading, there is no such thing as a guaranteed profit in the markets. The problem with human stock analysis is that there is so much data and so many variables that it is easy for the average human to become overwhelmed, get sucked down the rabbit hole and continue to make sub-optimal choices.

Sounds like a job for Machine Learning and there is no shortage of people and companies trying this as well. One major pitfall is that most ML algorithms do not work well with stock market type data. This also results in a lot of people of wasting a lot of time. But In order to share some of the concepts and get the conversation started I am posting some of my findings regarding Financial and Stock Forecasting using Machine Learning

I trained 8000 machine learning algorithms to develop a probabilistic future map of the stock market in the short term (5-30 days) and have compiled a list of the stocks most likely to bounce in this time frame. There is no single future prediction. Instead there is a large set of future probabilities which someone can use to evaluate their game plan and portfolio. My exact methods remain proprietary at this time (but might consider institutional licensing).

Here are the "Stock Picks" based on how they closed on Friday (Jan 24, 2014) based on the stock's individual trading behavior:

Disclaimer: This is not trading or investing advice. It is simply the output of my ML system. If you lose money, do not come crying. Trade at your own risk!

Since, the market got pummelled this week, there are a lot of stocks that look like 'buys' right now. But the overall (US) market is coming off a very prolonged euphoric period and it has not had a significant correction for over two years. So, it is possible that the current downswing is either a minor pullback a.k.a. "dip", or it is the start of a major correction.

Here are the charts. For the most part they look like a big sell-off in an larger uptrend. It is always interesting to see how the future unfolds and especially with respect to these predictions. Also, keep in mind, even if a stock does bounce, it could then run out of steam and drop again. Ah...life in the uncertainty zone ;).

Below is my take on the subject. The center is marked "Unicorn". This a reference to the recent discussions in the press and blogosphere indicating that Data Scientists are as hard to find as unicorns. Finally the mindset is changing that a team of people with complimentary skills is the course of action for most data driven organizations. Certainly some individuals might posses Computer Science, Statistics and Subject Matter Expertise. They are just very hard to find. Many Data Scientist job descriptions don't reflect this reality and so these positions go unfilled for six months or more.

1. Rusty C language skills
2. Learn another language: Julia which is supposed to be very fast
(still on my todo list).
3. Try Cython a form of python that "sort of" compiles to C.
4. What else was there???

PyPy got its start as a version of Python written in Python. At first,
this seemed kind of interesting for compiler people but not what I
needed. Then I learned that the PyPy team has been putting a lot of
effort into their JIT Compiler. A Just-In-Time (JIT) compiler converts
your code to machine language the first time it touches your code.
After that, it runs at machine speeds. The result is blazingly fast
Python! See http://speed.pypy.org/

There is a drawback: Many Machine Learning libraries do not run on it.
I had to remove all Pandas, Numpy, Scikit. So I broke my problem into
two steps: Feature generation in PyPy and Machine Learning in
Python/Pandas/SciKit. After that I was slicing and dicing
accelerometer readings like crazy. More importantly, I was iterating my
solution faster. Allowing me to finish 26th out of 633 teams (top 4%)!

Python is a great programming language. It is has a clean syntax, tremendous user community support, and excellent machine learning libraries. Unfortunately it is SLOW! So, when the situation calls for it, I prefer to drop down to machine code to run the actual machine learning algorithm.

One fast and amazing Machine Learning tool that I have used on a number of projects is Vowpal Wabbit. It was developed by researchers at Yahoo! Research and later at Microsoft Research. It has support for many types of learning problems, automatically consumes/vectorizes text, can do recommendations, predictions, classifications, (single and multi-class), supports namespaces, instance weighting, and the list goes on.

The problem with wrappers is that they don't always expose all the features you want to use. Vowpal has a lot of features. So, after a bit of hemming and hawing, I did a "slash and burn" then wrote what I needed. This is how I currently use Vowpal Wabbit with Python. Instead of a wrapper, I offer you code snippets which can be tailored to your specific needs.

This code assumes you know how to use Python and Pandas. It runs on linux and uses the matrix factorization feature (recommendation engine) of Vowpal.

Performance: With over 43 million rows, it took about 16 minutes to generate the inputs in the Pandas DataFrame, but only 9 minutes to train with 20 passes. (I7-2600K)

Enjoy!

Steve Geringer

##########################################################################
# Here are the essential ingredients. You'll have to fill in the
rest...;)
##########################################################################

import os
from time import asctime, time
import subprocess
import csv
import numpy as np
import pandas as pd
.
.
.
#############################################################
# Parameters and Globals
#############################################################
environmentDict=dict(os.environ,
LD_LIBRARY_PATH='/usr/local/lib')
# Hat Tip to shrikant-sharat for this secret incantation
# Note: only needed if you rebuilt vowpal and the new libvw.so is in
/usr/local/lib

I am a software consultant and have been involved with Machine Learning since
2002. A friend of mine and fellow Machine Learning enthusiast, Rohit
Sivaprasad of http://www.DataTau.com suggested I start a blog to share some of my
ideas and tips with the data science community.