Our Method and where we will be getting our Data

Getting the data you need to do your testing can prove to be quite daunting. We're entering the age of big data, and, just like you want data, so does everyone else.

Data is a commodity, and has a market value. If you've never received a quote for something like Twitter data, you'll most likely be astonished to find out how much it costs. Do you want the firehose? Hope you have millions. Streaming 100K tweets a day? 10's of thousands a month.

The above is only true, of course, if someone has the data to even sell you. You'll likely find that a lot of the data you seek doesn't even exist. Even if it does, it may not exist in a form that you can use. Even if it is nicely organized, is it all in numerical form? Is it normalized?

It can be a massive pain. For us, our question is:

Can we use machine learning to analyze public company (stocks) fundamentals (things like price/book ratio, P/E ratio, Debt/Equity ... etc), and then classify the stocks as either out-performers compared to the market (labeled as 1's), or under-performers (labeled as 0's).

With this question, we need fundamental company data. We need this data over the years as well. You may find that the data you want is just plain not easily obtainable. Much of the data you may desire is just available online, and not in an easily downloaded and used format. We're going to simulate that, only without the need for you to actually parse from some web server.

The download for the data is: , which is over a decade's worth of S&P 500 company fundamentals

This data is straight HTML source code for the S&P 500 index of companies over a bit over a decade from Yahoo Finance.

To navigate the SEC.gov website, you should go to "company filings" near the top right, then use the "fast search" by typing the company's ticker symbol, like AAPL for Apple. An example of some forms you may be interested in here would be the 10K and 10Q forms. The 10K is the annual report, and the 10Q is a quarterly report.

Yahoo Finance has a bunch of nicely organized data points all in a table. This isn't ideal for us, but we can work with it. It turns out there are some options for connecting to EDGAR via an API, so later we will cover using EDGAR specifically.

Once you download the data, extract the files. The structure is:

intraQuarter

-_AnnualEarnings

--stock files (organized by YYYYMMDDHHMMSS.html)

-_KetStats

--stock files (organized by YYYYMMDDHHMMSS.html)

-_QuarterlyEarnings

--stock files (organized by YYYYMMDDHHMMSS.html)

The next tutorial:

Intro to Machine Learning with Scikit Learn and Python

Simple Support Vector Machine (SVM) example with character recognition

Our Method and where we will be getting our Data

Parsing data

More Parsing

Structuring data with Pandas

Getting more data and meshing data sets

Labeling of data part 1

Labeling data part 2

Finally finishing up the labeling

Linear SVC Machine learning SVM example with Python

Getting more features from our data

Linear SVC machine learning and testing our data

Scaling, Normalizing, and machine learning with many features

Shuffling our data to solve a learning issue

Using Quandl for more data

Improving our Analysis with a more accurate measure of performance in relation to fundamentals