Columbia Applied Data Science2013-09-16T19:29:42-07:00http://columbia-applied-data-science.github.comIan Langmore, Daniel Krasner, Chang Sheapplied.data.science@gmail.comNotes on higher performance python2013-06-04T00:00:00-07:00http://columbia-applied-data-science.github.com/announcements/2013/06/04/notes-on-higher-performance-python<p>The newest version of the <a href="/appdatasci.pdf">lecture notes</a> includes a section on high(er) performance Python.</p>
<p>Sections include</p>
<ul>
<li>Memory hierarchy</li>
<li>Parallelism</li>
<li>Profiling</li>
<li>Standard Python rules of thumb</li>
<li>For loops versus BLAS</li>
<li>Stream processing of text</li>
<li>Multiprocessing</li>
</ul>
profiling and performance basics2013-05-12T00:00:00-07:00http://columbia-applied-data-science.github.com/extras/2013/05/12/profiling-and-performance-basics<h2>Note</h2>
<p>The newest version of the <a href="/appdatasci.pdf">lecture notes</a> includes a section on high(er) performance Python. Read that rather than this short post.</p>
<h2>Original post</h2>
<p>Python code can be very slow or very fast. For loops are slower than list comprehensions, which are much much slower than numpy calls or built in python functions. The latter two use optimized Fortran and C libraries. So the first rule of thumb is, <strong>whenever you find yourself writing a for loop or list comprehension, check if there is a built-in Python or numpy function that does the same thing</strong>.</p>
<p>The above rule always holds since built in functions will lead to simpler code (remember the importance of simplicity). However, there are often other optimizations that lead to slightly harder to read or more complicated code. To address this, first consider the following quote, credited to Donald Knuth, &quot;We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.&quot; Then solution is to <em>profile</em> your code first, to determine where the slow spots are, and then only re-write the parts that are slowing your code down.</p>
<p>To profile scientific code, you need a line-by-line readout of the time taken in different function calls in your code. This can be had by using the <em>line_profiler</em> in conjunction with the <em>kernprof</em> script. To install, simply type </p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">pip install line_profiler
</code></pre></div>
<p>Then, at the top of a <strong>function</strong> that you want to profile (you can only profile functions), put <code>@profile</code>. For example:</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">@profile
def myfun(x):
y = 2 * x
return y
</code></pre></div>
<p>Then you need some way to call <code>myfun</code> from the command line. This could be for example a script <code>run_myfun.py</code>, which could be as simple as:</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">from mymodule import myfun
myfun(10)
</code></pre></div>
<p>Then, at the command line, type</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">kernprof.py -l run_myfun.py
</code></pre></div>
<p>If you are using <em>anaconda</em> and have installed <em>line_profiler</em>, then <em>kernprof.py</em> will be in your <em>PATH</em>, and the above line will work. The above line will produce the file <em>run_myfun.py.lprof</em>. This is the profiler output. You need to use the module <em>line_profiler</em> to read it. To do this, type</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">python -m line_profiler run_myfun.py.lprof | less
</code></pre></div>
<p>You should see a line-by-line breakdown of time taken to run your code.</p>
Announcements: May 62013-04-30T00:00:00-07:00http://columbia-applied-data-science.github.com/announcements/2013/04/30/announcements-may-6<ul>
<li>A new version of the <a href="/appdatasci.pdf">lecture notes</a> has been posted.</li>
</ul>
Homework 08: Stackoverflow questions2013-04-29T00:00:00-07:00http://columbia-applied-data-science.github.com/homework/2013/04/29/homework-08-stackoverflow-questions<h1>Homework 08: Stackoverflow</h1>
<p><strong>Due:</strong> May 13, in class presentation during the final exam slot (7:10 - 11pm). Also, a pdf of your slides must be emailed to <a href="mailto:applied.data.science@gmail.com">applied.data.science@gmail.com</a> before 7:00pm.</p>
<p>You will use logistic regression to predict whether a Stackoverflow question will be closed or not. This assignment is similar to this <a href="http://www.kaggle.com/c/predict-closed-questions-on-stack-overflow">Kaggle competition</a>.</p>
<hr>
<h2>Guidelines</h2>
<hr>
<h3>General</h3>
<ul>
<li>You must use logistic regression to give the probability that a Stackoverflow question will be closed or not.</li>
<li>For modeling, you can use any existing Python (not R) packages such as <em>statsmodels</em> or <em>sklearn</em>. You can use any Unix utility.</li>
<li>You must attempt to use both the numeric data (e.g. <em>ReputationAtPostCreation</em>) as well as the text data <em>Title</em> and <em>BodyMarkdown</em>.</li>
<li>For the test data, you must report:
<ul>
<li>The ROC AUC (Area Under the Curve)</li>
<li>How well your <em>predicted average closed rate</em> matches <em>reality</em> for users in the bottom/middle/top third in terms of reputation</li>
</ul></li>
<li>You must build a classifier that uses your logistic regression model. Pick a cutoff that makes sense for this problem and explain why you chose it.</li>
<li>You must also build a classifier that uses the exact same variables as your logistic classifier, but uses some other technique such as a <em>random forest</em>, <em>SVM</em>, and <em>nearest neighbors</em>. You must compare this classifier with the logistic classifier and explain which worked better and why.</li>
</ul>
<h3>Presentation</h3>
<ul>
<li>20 minutes presentation, 10 minutes questions. At least two group members must talk.</li>
<li>Your slides must be in pdf format</li>
<li>Email your slides to <a href="mailto:applied.data.science@gmail.com">applied.data.science@gmail.com</a> before 7pm on the day of the final. Put your group name in the title of the pdf.</li>
<li>Your presentation should describe why you chose to keep/create/throw-away variables</li>
<li>Your presentation should describe how you evaluated your model, the results of the evaluation, and why this evaluation was or was not sufficient</li>
<li>Your presentation should not describe your data-munging. EDA should be described only insomuch as it relates to the above tasks.</li>
</ul>
<h3>Data</h3>
<ul>
<li>Your training data should be the file named <em>train</em> taken from <a href="http://www.kaggle.com/c/predict-closed-questions-on-stack-overflow/data">this website</a>.<br></li>
<li>Your test data should be current data that you obtain by using the <a href="https://api.stackexchange.com/">Stackoverflow API</a>. Note that you can get lots and lots of variables using the API. Only get the ones that are also in the training set. There are no hard requirements on the amount of test data you must obtain. You must explain why you used the number of samples that you did, and why it makes the &quot;prediction vs. reality&quot; test statistically significant. You are permitted to use your test set as a cross validation set (usually this is not good practice, but (i) there is no way to stop you from doing this, and (ii) this will give you experience with out-of-time errors).</li>
</ul>
<hr>
<h2>About the starting repo</h2>
<hr>
<p>This is meant to give you a decent starting point. You can modify it as you wish.</p>
<h3>Dependencies</h3>
<p>You will probably use code from previous homeworks, especially <em>cut.py</em> and <em>subsample.py</em>.</p>
<h3>Directories</h3>
<h4>data</h4>
<ul>
<li>Don&#39;t version data.</li>
<li>To avoid excess sharing of processed data (which changes often), it is preferable to share raw data and the scripts and notebooks that transform <em>raw</em> into <em>processed</em>.</li>
<li>Contents of any <em>raw</em> folder should never be modified or deleted. This way, your script will create the same output as everyone else&#39;s script.</li>
<li>Shell scripts and notebooks should assume the existence of the <strong>local</strong> folders <code>data/raw</code> and <code>data/processed</code>. They already exist in the repo.</li>
</ul>
<h4>Notebooks</h4>
<p>For ipython notebooks. Put your name in the notebook name to avoid redundancy.</p>
<h4>src</h4>
<p>Source code.</p>
<h4>tests</h4>
<p>Unit and integration tests. Add these if you want.</p>
<h4>scripts</h4>
<p>Shell scripts.</p>
Final exam2013-04-29T00:00:00-07:00http://columbia-applied-data-science.github.com/exams/2013/04/29/final-exam<p>We estimate that the May 6 final exam will have:</p>
<ul>
<li>1 git question</li>
<li>2-3 unix (incl. regex)</li>
<li>2 nltk</li>
<li>2 dataflow (read/write/IO/stdin/stdout/API)</li>
<li>2 numpy/pandas</li>
<li>1 linear</li>
<li>2-3 logistic</li>
<li>1 naive bayes</li>
<li>1 decision trees/random forest</li>
<li>2 ROC/R2/PseudoR2</li>
<li>1 nonlinear optimization</li>
</ul>
Homework 07: Hints2013-04-28T00:00:00-07:00http://columbia-applied-data-science.github.com/homework/2013/04/28/homework-07-hints<ul>
<li>Exercise 6.3.1. Here I&#39;m looking for you to say how mislabled data can be re-phrased of as an error in your model. There are probably many correct answers.</li>
<li>Exercise 6.5.2. The key point (that I did not explicitly state...sorry!) is that the data is trained with normal, non-truncated linear regression.</li>
<li>Exercise 6.5.3. In part 1, assume the epsilon are iid.</li>
</ul>
Starting projects that involve lots of file I/O2013-04-26T00:00:00-07:00http://columbia-applied-data-science.github.com/extras/2013/04/26/starting-projects-that-involve-lots-of-file-io<p>Often times I work on a project where the general goal is:</p>
<ol>
<li>Read lots of files from disk</li>
<li>Modify and extract information from the files</li>
<li>Write results to disk</li>
</ol>
<p>Steps 1 and 3 provide the <em>interface</em> (in this case the plumbing that interfaces with the OS and the disk) and step 2 is the <em>implementation</em> (the logic that you want to implement). The point of this post is that interface should be separated from implementation. The reason is that interface and implementation tend to change at different times. To illustrate, imagine you write the following script:</p>
<div class="highlight"><pre><code class="python language-python" data-lang="python"><span class="n">infilename</span> <span class="o">=</span> <span class="s">&#39;data/smallfile.csv&#39;</span>
<span class="n">outfilename</span> <span class="o">=</span> <span class="s">&#39;data/my_outfile.csv&#39;</span>
<span class="n">modify_and_write</span><span class="p">(</span><span class="n">infilename</span><span class="p">,</span> <span class="n">outfilename</span><span class="p">)</span>
</code></pre></div>
<p>This would work fine during an initial development phase where you want to test your script on one single file. It however doesn&#39;t work if you want to modify many files. You could change this with:</p>
<div class="highlight"><pre><code class="python language-python" data-lang="python"><span class="n">indir</span> <span class="o">=</span> <span class="s">&#39;data/&#39;</span>
<span class="n">outfilename</span> <span class="o">=</span> <span class="s">&#39;data/my_outfile.csv&#39;</span>
<span class="k">for</span> <span class="n">infilename</span> <span class="ow">in</span> <span class="n">get_filenames</span><span class="p">(</span><span class="n">indir</span><span class="p">):</span>
<span class="n">modify_and_write</span><span class="p">(</span><span class="n">infilename</span><span class="p">,</span> <span class="n">outfilename</span><span class="p">,</span> <span class="n">append</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div>
<p>This solves the first problem, but suppose you want to read from standard in and/or write to standard out (this would be helpful since then you could tie modify<em>and</em>write together with other utilities)? To do this, you could pass open <em>file objects</em> rather than file names.</p>
<div class="highlight"><pre><code class="python language-python" data-lang="python"><span class="c"># Use with hardcoded files in a script</span>
<span class="n">indir</span> <span class="o">=</span> <span class="s">&#39;data/&#39;</span>
<span class="n">outfilename</span> <span class="o">=</span> <span class="s">&#39;data/my_outfile.csv&#39;</span>
<span class="k">for</span> <span class="n">infilename</span> <span class="ow">in</span> <span class="n">get_filenames</span><span class="p">(</span><span class="n">indir</span><span class="p">):</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">infilename</span><span class="p">,</span> <span class="s">&#39;r&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">outfilename</span><span class="p">,</span> <span class="s">&#39;w&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">g</span><span class="p">:</span>
<span class="n">modify_and_write</span><span class="p">(</span><span class="n">infile</span><span class="p">,</span> <span class="n">outfile</span><span class="p">)</span>
<span class="c"># Use with stdin/stdout as part of a larger program.</span>
<span class="n">modify_and_write</span><span class="p">(</span><span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="p">,</span> <span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="p">)</span>
</code></pre></div>
<p>Here we have pushed the file opening part of the interface away from the modification part. This allows us to tie modify<em>and</em>write together with other programs or use it by itself. An example can be found <a href="https://github.com/langmore/utils/blob/master/src/generic_filter.py">here</a>.</p>
<p>This sort of setup is good if you know ahead of time that you will be reading/writing from files or stdin/stdout only. Although this is a good way to tie programs together, unix pipelines can be restrictive. Suppose all files are small. Then it is possible to read them in all at once. In this case you can write:</p>
<div class="highlight"><pre><code class="python language-python" data-lang="python"><span class="c"># Simple script to write as you develop modify_lines</span>
<span class="n">infilename</span> <span class="o">=</span> <span class="s">&#39;data/smallfile.csv&#39;</span>
<span class="n">outfilename</span> <span class="o">=</span> <span class="s">&#39;data/my_outfile.csv&#39;</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">infilename</span><span class="p">,</span> <span class="s">&#39;r&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">lines</span> <span class="o">=</span> <span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">newlines</span> <span class="o">=</span> <span class="n">modify_lines</span><span class="p">(</span><span class="n">lines</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">outfilename</span><span class="p">,</span> <span class="s">&#39;w&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">g</span><span class="p">:</span>
<span class="n">g</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">newlines</span><span class="p">)</span>
</code></pre></div>
<p>Above, <code>modify_lines</code> takes in the bare minimum that it needs in order to modify the lines in the file takes in the bare minimum that it needs in order to modify the lines in the file. This the string returned by <code>f.read()</code>. Later, when we decide exactly how <code>modify_lines</code> will be used, we can build the interface. If that interface changes over time, that is fine, because the implementation (<code>modify_lines</code>) doesn&#39;t need to change. For example, we can decide to read <code>lines</code> from stdin, or a file, or another function.</p>
Announcements: April 292013-04-24T00:00:00-07:00http://columbia-applied-data-science.github.com/announcements/2013/04/24/announcements-april-29<ul>
<li>A couple corrections were made on April 24 at 8pm to exercise 6.5.3.</li>
<li><a href="/Homework/2013/04/28/homework-07-hints/">Hints</a> were given for HW 7</li>
<li>See <a href="/Exams/2013/04/29/final-exam/">this</a> estimate of material on the final exam (May 6)</li>
<li>Homework 8 has been <a href="/Homework/2013/04/29/homework-08-stackoverflow-questions/">posted</a>. Teams assignments will be mailed out soon.</li>
</ul>
Announcements: April 242013-04-22T00:00:00-07:00http://columbia-applied-data-science.github.com/announcements/2013/04/22/announcements-april-24<ul>
<li>The written traditional final exam will take place on the last day of normal class, May 6. It will be worth the same as a homework. Details to follow.</li>
<li>During the final exam slot, May 13, 7:10 - 11:00 pm, you will be doing presentations as part of homework 8. Details to follow</li>
<li>A new version of the <a href="/appdatasci.pdf">lecture notes</a> has been posted. A small addition to the end of section 6.5 was made.</li>
</ul>
Homework 072013-04-21T00:00:00-07:00http://columbia-applied-data-science.github.com/homework/2013/04/21/homework-07<p><strong>Due:</strong> Monday April 29, hand in write-up in class as a written or printed piece of paper.</p>
<p><strong>What it is</strong>
Do every exercise in the logistic regression chapter of the <a href="/appdatasci.pdf">lecture notes</a>. This is not group work. Every person must turn in their own solutions. You are allowed to work with others.</p>
Announcements: April 222013-04-18T00:00:00-07:00http://columbia-applied-data-science.github.com/announcements/2013/04/18/announcements-april-22<ul>
<li>Some updates were made to the <a href="/appdatasci.pdf">lecture notes</a>. In particular, the theorem on L1 variable selection was re-worded and the proof fixed.</li>
<li><a href="/Homework/2013/04/21/homework-07/">Homework 7</a> has been posted.</li>
</ul>
Announcements April 172013-04-17T00:00:00-07:00http://columbia-applied-data-science.github.com/announcements/2013/04/17/announcements-april-17<ul>
<li>The logistic regression <a href="/appdatasci.pdf">lecture notes</a> have been posted.</li>
</ul>
Crash Course on APIs, pandas/statsmodels timeseries API2013-04-11T00:00:00-07:00http://columbia-applied-data-science.github.com/lecture/2013/04/11/apis-timeseries<p>The slides for the Crash Course on APIs, the Github API example notebook, and the pandas timeseries API notebook all live in the public repository git@github.com:columbia-applied-data-science/lecture_timeseries.git</p>
<p>The slides are in PDF format, and you can see the static versions of the notebooks here:</p>
<p>Pandas Timeseries:</p>
<p>http://nbviewer.ipython.org/urls/raw.github.com/columbia-applied-data-science/lecture_timeseries/master/time%2520series%2520python.ipynb</p>
<p>Github API:</p>
<p>http://nbviewer.ipython.org/urls/raw.github.com/columbia-applied-data-science/lecture_timeseries/master/Github%2520API.ipynb</p>
<p>You can find the GitHub API reference here: http://developer.github.com/v3/</p>
<p>Note that Web APIs live independently of the programming language and your Python code just need to construct the right URL that encodes all the search parameters. In addition, for many web APIs, you must go through an authentication step in order to obtain certain types of data.</p>
Notes on your GSS models2013-03-22T00:00:00-07:00http://columbia-applied-data-science.github.com/homework/2013/03/22/notes-on-your-gss-models<p>Here are some general comments that applied to many people&#39;s homework.</p>
<ul>
<li>The best presentations told a &quot;story.&quot; They told the steps you used, as well as the results and why the results were good or bad.</li>
<li>If the direct inversion method fails on a small data set such as this, then there is often a problem with your data. E.g. you have linearly dependent (i.e. redundant) variables. The best approach is to figure out why there are issues and then fix them.</li>
<li>The pandas function <code>get_dummies</code> can be used to get indicators for categories, e.g. <em>is_married</em>.</li>
<li>I didn&#39;t see anyone building new variables from (nonlinear) combinations of more than one old variable. That would have been nice.</li>
<li>Many of the NaN values were follow-up questions that could have been used to build new variables. For example, if someone answers <em>yes</em> to &quot;have you ever been a smoker&quot;, then they also get to answer the follow up, &quot;have you ever tried to quit smoking.&quot; This could be used to create two new variables, <em>smoking<em>tried</em>quitting</em> and <em>smoking<em>never</em>tried_quitting</em>. Or, you could figure that people who never tried to quit were more severe smokers, and therefore they get a 2, people who tried quitting get a 1, and people who never smoked get a 0. This way you create one single new variable with three levels.</li>
</ul>
Announcements March 132013-03-12T00:00:00-07:00http://columbia-applied-data-science.github.com/announcements/2013/03/12/announcements-march-13<ol>
<li>There was a bug in the cross validator module from HW 3. In the docstring of <code>cross_validator._get_xy_traincv()</code> I had switched the usage of the &quot;cv set&quot; and the &quot;training set.&quot; The correct docstring reads:</li>
</ol>
<div class="highlight"><pre><code class="text language-text" data-lang="text">Returns slices of X and Y used for training and cv. The cv
set should be e.g.: X[istart: istop, :], and the training set should
be everything else.
</code></pre></div>
<p>The updated unit tests are <a href="/misc/testlinear.py">here</a></p>
Announcements March 112013-03-06T00:00:00-08:00http://columbia-applied-data-science.github.com/announcements/2013/03/06/announcements-march-11<ol>
<li>For HW 4, email a pdf presentation to the TA, and be prepared to present in class. Do both of these on March 13. Nothing else is due at any time.</li>
<li>Suggestions for your presentation:</li>
</ol>
<ul>
<li>List the variables your models use, along with the values of the corresponding coefficients. Do the coefficient signs make sense?</li>
<li>Show your error and how you evaluated error. Remember that you must train/cross-validate using 2006 data, and then test (i.e. measure your error) using the 2010 data.</li>
</ul>
<ol>
<li>I noticed that there are quite a few variables that are actually the same as income (up to some constant). If you notice this, then it&#39;s ok to point it out. However, please spend your time building models that don&#39;t use them.</li>
<li>Tuesday&#39;s OH moved to Wedn 11-12. </li>
</ol>
IDSE Symposium Call for Student Volunteers2013-03-01T00:00:00-08:00http://columbia-applied-data-science.github.com/data%20science%20activities/2013/03/01/idse-symposium---call-for-student-volunteers<p>Columbia&#39;s Institute for Data Science and Engineering has an upcoming <a href="http://idse.columbia.edu/institute-data-sciences-and-engineering-symposium">symposium</a>. It is an &quot;invite only&quot; event and doing some grunt work may be your only way in. See below:</p>
<p>See this <a href="https://docs.google.com/spreadsheet/ccc?key=0Apka-zOhcb_FdGo0R0dLTmdkeWlTVzFoeDl4azZEWlE&amp;usp=sharing#gid=0">google doc</a> for a breakdown of the shifts and tasks available – volunteers may sign up for any/all available shifts (business attire advised).</p>
Homework 042013-02-28T00:00:00-08:00http://columbia-applied-data-science.github.com/homework/2013/02/28/homework-04<h1>Homework 4: Linear Regression with GSS Data</h1>
<p>You will use your linear regression module from last week to analyze the <a href="http://www3.norc.org/gss+website/">General Social Survey</a> (GSS) data. This is a yearly social science survey that &quot;takes the pulse of America.&quot;</p>
<p><strong>Presentation Wednesday March 13</strong> Email a copy to the TA and be prepared to present in class.</p>
<hr>
<h2>Data directory layouts</h2>
<p>You will have to transform your data by cleaning/cutting out certain columns. This can lead to a mess of different data files. Here are some suggestions.</p>
<p>You have the following layout by default:</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">notebooks/
scripts/
src/
data/
raw/
processed/
</code></pre></div>
<ul>
<li>Remember not to commit any data to the repository.</li>
<li>I like to keep the data in <code>raw</code> completely untouched copies from websites, or common transformations of them (e.g. the csv files that result of <code>Getting the data</code> above). The key point is that once data goes into <code>raw</code> I <em>never</em> change it.</li>
<li>The <code>processed</code> directory is for the altered versions of the <code>raw</code> directory.</li>
<li>There is a <code>scripts</code> directory that can be used to store scripts (Python or Bash) that transform data from the type in <code>raw</code> to the type in <code>processed</code>.<br></li>
<li>T also use ipython notebooks (stored in <code>notebooks/</code> to transform data from <code>raw</code> to <code>processed</code>. I commit these scripts and notebooks to the repo and other people can run them to get copies of my processed data.</li>
<li>For longer projects I create snapshots of the <code>processed</code> directory that have a timestamp on their name. E.g. <code>processed-2013-02-11</code>.</li>
</ul>
<hr>
<h2>Project deliverables</h2>
<ol>
<li>Use the 2006 data to train models, test those models on 2010 data.</li>
<li>Predict <code>income06</code> as a function of other variables. For simplicity, this should be one single model that includes everyone...in other words, don&#39;t segment your data. Note that <code>income06</code> is missing in about 15% of responses. You don&#39;t have to predict <code>income06</code> for these people. This model should work, even in the presence of missing data (with the exception of missing <code>income06</code>)! So, you should probably fill the missing values with something.</li>
<li>Find one other relation to predict. Make sure it is appropriate for linear regression. Segment or do whatever you want. The model can work for the whole population or subpopulations.</li>
<li>Make a 15 minute slide-show presentation documenting your work. This is what you turn in. Two randomly chosen groups will present this in class on Wednesday March 13. The intended audience is the class...so present at the appropriate technical level.</li>
</ol>
<hr>
<h2>Getting started</h2>
<h3>Useful links</h3>
<ul>
<li>The <a href="/misc/2008_GSS_Codebook.pdf">2008 GSS Codebook</a> will be useful for variable definitions.</li>
<li>The <a href="http://publicdata.norc.org/GSS/DOCUMENTS/OTHR/GSS_NESSTAR_Guide.pdf">GSS User&#39;s Guide</a> shows you how to search for variable description using the website. The website is very very very slow.</li>
</ul>
<h3>Basic workflow</h3>
<ol>
<li>Get data</li>
<li>Inspect data</li>
<li>Clean data</li>
<li>Explore relationships (EDA)</li>
<li>Fit model</li>
<li>Inspect results</li>
<li>Repeat 2-7</li>
</ol>
<h4>Get data</h4>
<ol>
<li>Download the 2006 and 2010 datasets from <a href="http://www3.norc.org/GSS+Website/Download/STATA+v8.0+Format/">this site</a>. Get the individual years, which are under &quot;<em>Download Individual Year Data Sets</em>&quot;</li>
<li>Convert these STATA dataset into Pandas DataFrames using <a href="/Extras/2013/02/15/convert-stata/">these instructions</a></li>
<li><p>Store them as csv files using (assuming the DataFrame is named <code>df</code>):</p>
<p>df.to_csv(&#39;filename&#39;, index=False)</p></li>
</ol>
<h4>Inspect data</h4>
<ul>
<li>Inspect the csv files with <code>less</code> and see what they look like. Remember <code>Ctrl-f</code>, <code>Ctrl-b</code> to move forward and backward.</li>
<li>Use <code>head</code> to create a file (probably located in <code>/tmp/</code>) containing the first 100 lines. Look at this file in excel (or <code>libreoffice</code> in ubuntu). You may get an error about too many columns...that&#39;s ok, just look at what you can!</li>
</ul>
<h4>Clean data</h4>
<ul>
<li>Start your ipython notebook with <code>ipython notebook --pylab inline</code>.<br></li>
<li>Create a new notebook named <code>cleaning-your-name</code>. This will be used for cleaning data.</li>
<li>See <code>notebooks/HW4_cleaning_EDA</code> for an example.</li>
<li>Read in the 2006 and 2010 datasets into DataFrames named <code>df2006</code>, <code>df2010</code>.<br></li>
<li>We will probably have little use for columns that are mostly NaN. Use <code>df.count().order()</code> to figure out which columns have lots of missing values. Chop off these columns by creating a boolean mask that will be true if a column has enough good entries and then using <code>df.ix[:, mask]</code>.</li>
<li><p>We are only interested in variables that are in both datasets, so use pandas reindex to modify and align the columns like so</p>
<p>col = df2006.columns.intersection(df2010.columns)</p>
<p>df2006 = df2006.reindex(columns=col)</p>
<p>df2010 = df2010.reindex(columns=col)</p></li>
</ul>
<h4>EDA</h4>
<p>See <code>notebooks/HW4_cleaning_EDA</code> or goto <a href="http://nbviewer.ipython.org/url/columbia-applied-data-science.github.com/misc/HW4_cleaning_EDA.ipynb">this link</a> for an example.</p>
<h4>Build your model</h4>
<p>See <code>notebooks/HW4_regression_example</code> or goto <a href="http://nbviewer.ipython.org/url/columbia-applied-data-science.github.com/misc/HW4_regression_example.ipynb">this link</a> for an example.</p>
<p>You will have to add variables and see if it improves your fit. Make sure your variables make sense intuitively. Do EDA and read about the data to gain intuition.</p>
Announcements March 42013-02-28T00:00:00-08:00http://columbia-applied-data-science.github.com/announcements/2013/02/28/announcements-march-4<ol>
<li>The unit test <code>TestLinearReg.test_solve_pinv_4</code> had an issue. It used 0 as a cutoff. On some machines, <code>svd</code> does not return a zero singular value, instead it returns a small nonzero value (due to roundoff error). Please use the revised unittests available <a href="/misc/testlinear.py">here</a>.</li>
<li>Notebooks the <a href="http://nbviewer.ipython.org/url/columbia-applied-data-science.github.com/misc/HW4_cleaning_EDA.ipynb">EDA/cleaning notebook</a> and <a href="http://nbviewer.ipython.org/url/columbia-applied-data-science.github.com/misc/HW4_regression_example.ipynb">regression notebook</a> are now available.</li>
<li>For those who are interested there is a intermediate git workshop happening on campus (thanks to
Michael Discenza for letting us know). Here are the details:</li>
</ol>
<p>Intermediate Git Workshop
Tue, March 12, 9pm – 10pm
Where<br>
Hamilton 603
map
Calendar<br>
Application Development Initiative
Created by<br>
znewman01@gmail.com
Description </p>
<p>Know how to use Git but don&#39;t know anything about it&#39;s internals? Want to learn how to rebase, cherry-pick, and fix merge conflicts like a champ? Use git effectively for collaboration and development using the techniques in this workshop.</p>
<p>Presupposes only a basic knowledge of Git/VCS.</p>
<ol>
<li><p>Don&#39;t forget there will be presentations next week March 12th. Two teams will be randomly chosen to for 15mins. </p></li>
<li><p><a href="/regex.png">Here</a> is a nice list of regular expression wild cards, what they do and where they are supported. </p></li>
</ol>
Announcements Feb 272013-02-26T00:00:00-08:00http://columbia-applied-data-science.github.com/announcements/2013/02/26/announcements-feb-27<ol>
<li>I made a change in the lecture notes that ended up affecting the homework numbering. I undid that change and now the old numbering is back.</li>
</ol>
Learning Numpy and Pandas2013-02-21T00:00:00-08:00http://columbia-applied-data-science.github.com/extras/2013/02/21/learning-numpy-and-pandas<p>The <a href="http://nbviewer.ipython.org/url/stat.columbia.edu/%7Elangmore/numpy_intro.ipynb">numpy notebook</a> and <a href="http://nbviewer.ipython.org/url/stat.columbia.edu/%7Elangmore/pandas_basics.ipynb">pandas notebook</a> and that Chang used in class are now available. </p>
<p>I also recommend the following for more Numpy information:</p>
<ul>
<li>Chapter 4 from <a href="http://oreilly.com/shop/product/0636920023784.html">Wes&#39;s book</a></li>
<li>If you&#39;re a MATLAB user, then <a href="http://www.scipy.org/NumPy_for_Matlab_Users">this</a> is useful. Note: Don&#39;t use the <em>matrix</em> class. Just use normal Numpy nd.arrays.</li>
<li>The <a href="http://docs.scipy.org/doc/numpy/user/basics.html">official docs</a> are also useful.</li>
</ul>
Announcements: Feb 252013-02-21T00:00:00-08:00http://columbia-applied-data-science.github.com/announcements/2013/02/21/announcements-feb-25<ol>
<li>See <a href="/Extras/2013/02/21/learning-numpy-and-pandas/">this post</a> for tutorials/docs/notebooks to help you learn Numpy.</li>
<li>My previous announcement said that all exercises from the linear algebra chapter are due. I changed my mind and only some are due.</li>
<li>Homework 3 is now due March 4th at 6pm. </li>
</ol>
<p>A new version of the notes with some changes has been posted. Some of these changes effect the homework. See below:</p>
<ul>
<li>To avoid confusion with standard deviation, I changed the symbol for singular values to lambda</li>
<li>Problem 6.12.1 has a typo. It should read, &quot;we will have at least one singular value sigma_k = 0 for <code>k &lt;= K</code>&quot;, not &quot;<code>k&lt;K</code>.&quot;</li>
<li>Exercise 6.14.1 had a errors in the w estimate. This same exercise had some confusing wording regarding the error model. See the updated lecture notes for an improvement.</li>
<li> Exercise 6.6.1 will be easier to answer after you read remark 6.11. In other words, this question can be rigorously answered by using the SVD solution to the least squares problem.</li>
<li><p>In <code>homework_03/src/simulator.py</code>, in the function <code>gaussian_samples()</code>, there was a comment that should not be there. The comment was</p>
<h1>Start with an identity then populate off diagonal entries</h1></li>
</ul>
Homework 032013-02-19T00:00:00-08:00http://columbia-applied-data-science.github.com/homework/2013/02/19/homework-03<h1>homework_03</h1>
<p><strong>Due Mar 4th, 6pm</strong> </p>
<p><strong>Code</strong> To receive full credit all unit tests must pass and one copy of the exercises must be completed.</p>
<p><strong>Exercises</strong>: 6.4.2, 6.4.3, 6.6.1, 6.7, 6.9.1, 6.12.1, 6.12.2, 6.14.1, 6.19.1</p>
<hr>
<h2>To start</h2>
<p>Clone the repo into a local directory named <code>homework_03</code>. Do not use the original repo name. Replace <code>X</code> below with your team name.</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">git clone https://github.com/columbia-applied-data-science/homework_03_team_X.git \
homework_03
</code></pre></div>
<p>See <code>demo.py</code> and the tests to get an idea of how things work.</p>
<hr>
<h2>Numerical techniques</h2>
<p>See this section in the lecture notes information about pseudo inverses.</p>
<hr>
<h2>5-fold cross validation</h2>
<p>You will make a 5-fold cross validation module. This is used as a way to pick
out your regularization parameter delta. Our 5-fold cross validation is:</p>
<p>For every delta:</p>
<ol>
<li>Divide the data up into 5 equal chunks</li>
<li>Pick out the first chunk as a cross-validation set, and group the other 4
together as training data.</li>
<li>Fit the model using the training data and use the cross validation set to
measure both the training and cross-validation squared error |Xw - Y|^2</li>
<li>Repeat 5 times, each time using a different chunk as the cross validation
set.</li>
<li>Average the training and cross-validation errors across the 5 folds.</li>
</ol>
<p>Compare the average cross-validation errors and use this to choose delta.
Note that the training error should not be used to choose delta. It is
there to serve as a reality check and to diagnose the degree of over/under
fitting.</p>
<hr>
<h2>Caution!</h2>
<p>These routines are very picky about array shape. Some functions, e.g. np.dot,
return arrays who have shape = (N,) (a tuple with only one element). In that
case, you will often have to reshape this into a proper two dimensional array.
The docstring for linear_reg.fit() tells you when to do this.</p>
<p>Two functions, linear<em>reg.fit() and cross</em>validator.cross_val() can handle
pandas objects as their input. The others may or may not. However, these
are the only public methods in their modules, so this is ok.</p>
sed oddities2013-02-18T00:00:00-08:00http://columbia-applied-data-science.github.com/extras/2013/02/18/sed-oddities<p>The instructions for HW 02 told you to use a <code>sed</code> command:</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">sed -e &#39;s/|Open/&amp;\n/g&#39;
</code></pre></div>
<p>This should put a newline after every occurrence of the string <code>|Open</code>. Some older versions of <code>sed</code> don&#39;t work this way however. In these versions, instead of a newline, the letter <code>n</code> will be inserted. If this happens to you, change your command to:</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">sed -e &#39;s/|Open/&amp;\
/g&#39;
</code></pre></div>
<p>Note that I have actually hit the <code>Enter</code> key on my keyboard, which put a newline into the script. This should work on all versions of <code>sed</code>. You can test this by writing:</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">echo -e &#39;abcabcabc&#39; | sed &#39;s/c/&amp;\n/g&#39;
echo -e &#39;abcabcabc&#39; | sed &#39;s/c/&amp;\
/g&#39;
</code></pre></div>
<p>and seeing which of the two works. Both scripts are trying to insert newlines before every <code>a</code>.</p>
Converting datasets from STATA2013-02-15T00:00:00-08:00http://columbia-applied-data-science.github.com/extras/2013/02/15/convert-stata<p>It&#39;s very important in practical data science to know how to convert datasets into the right format and structure. Being able to import data from language-/tool- specific formats like Stata is something that&#39;s very useful, especially for a lot of social science data. Fortunately, that&#39;s already available as part of the StatsModels library in Python.</p>
<p>Here&#39;s how you do it:</p>
<ol>
<li><p>In the terminal, execute the command: <code>pip install -U statsmodels</code>. This should upgrade you to 0.5.0+. If you already have the latest version of statsmodels, you can skip this step.</p></li>
<li><p>In ipython:
<pre>
import statsmodels.iolib.foreign as smio
from pandas import DataFrame
arr = smio.genfromdta(&#39;~/path/to/stata/data.dta&#39;)
frame = DataFrame.from_records(arr)
</pre></p></li>
</ol>
<p>The <code>genfromdta</code> function in <code>statsmodels.iolib.foreign</code> converts a dta file to a NumPy record array (special numpy array type). The last line above show how to convert the record array into a pandas DataFrame so the data can live happily ever after.</p>
Announcements: Feb 182013-02-13T00:00:00-08:00http://columbia-applied-data-science.github.com/announcements/2013/02/13/announcements-feb-18<ol>
<li>The linear regression notes are now posted as part of the <a href="/appdatasci.pdf">lecture notes</a>.</li>
<li>Every <em>exercise</em> from the linear regression chapter will be due as part of the next homework (due Feb 27). Every team hands in one written solution set to these exercises.</li>
<li>Here are some resources for learning Python
<ul>
<li><a href="http://software-carpentry.org/4_0/python/intro.html">Software Carpentry</a></li>
<li><a href="http://www.codecademy.com/tracks/python">Codecademy</a></li>
</ul></li>
<li>See <a href="/Extras/2013/02/18/sed-oddities/">this post</a> about possible problems using the prescribed <code>sed</code> command on a mac.</li>
<li>Feb 18 lecture is on numpy/pandas</li>
<li>Feb 20 lecture is on cross-validation (necessary for HW 03)</li>
</ol>
Homework 022013-02-10T00:00:00-08:00http://columbia-applied-data-science.github.com/homework/2013/02/10/homework-02<p>Homework 2 has been handed out via email and github notifications. It is due Feb 18.</p>
<hr>
<h1>The README.md handed out with the hw</h1>
<p>This homework will have you write shell scripts that that use unix utilities and python utilities that you build. This is done in the name of analyzing (an altered version of) the <a href="https://data.sfgov.org/Service-Requests-311-/Case-Data-from-San-Francisco-311/vw6y-z8j6">SF 311 Dataset</a>. This altered version is available <a href="http://stat.columbia.edu/%7Elangmore/Case_Data_from_San_Francisco_311.csv">here</a></p>
<p><strong>Due:</strong> Monday Feb 18, 6pm.</p>
<p>To receive full credit, you must commit and push code that passes all unit tests, and shell scripts that give the correct output.</p>
<hr>
<h2>Setup</h2>
<p>Clone the repo and save it in a local directory called <code>homework_02</code> by typing</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">git clone https://github.com/columbia-applied-data-science/homework_02_team_XX.git \
homework_02
</code></pre></div>
<h2>Utilities</h2>
<p>Note: To use the pytyhon utilities, your PYTHONPATH must be modified. In your <code>~/.bashrc</code> (or <code>~/.bash_profile</code> on macs), put</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">export PYTHONPATH=path-to-directory-above-homework_02:$PYTHONPATH
</code></pre></div>
<p>Then source it with <code>source ~/.bashrc</code> or open a new terminal.</p>
<p>To see how the utilities <em>should</em> work:</p>
<ul>
<li><p>Create a comma delimited file with a header and run the utilities on it. Set a breakpoint and step through, guessing reading the comments and code fragments provided. You can view the documentation for each utility by typing <code>python utilityname -h</code>.</p></li>
<li><p>Go to <code>test/</code> and view the unit tests in <code>test/testutils.py</code>.</p></li>
<li><p>Look at the comments in the utilities. These are only hints. Any utility that passes tests is acceptable.</p></li>
</ul>
<h3>body</h3>
<p>Note: This utility will not be tested, it is just given to you.</p>
<p>In your <code>.bashrc</code>, put</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">body() {
IFS= read -r header
printf &#39;%s\n&#39; &quot;$header&quot;
&quot;$@&quot;
}
export -f body
</code></pre></div>
<p>then source the bashrc.</p>
<p>This allows you to run a command on the body of the function, skipping the header (but still printing the header). For example,</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">cat filewithheader | body sort -k1,1
</code></pre></div>
<p>will sort <code>filewithheader</code>, using the first field, but leave the header at the top of the file.</p>
<h3>cut.py</h3>
<p>Acts like the unix cut utility, except...</p>
<ul>
<li>Takes field names rather than numbers</li>
<li>Uses the python csv module for more automatic handling of stuff like quoted delimiters</li>
</ul>
<h3>reformat.py</h3>
<p>Reformats stuff like delimiters and capitalization</p>
<h3>common.py</h3>
<p>Common files for all utilities</p>
<h3>averager.py</h3>
<p>Gets the average of different groups of a sorted file</p>
<h3>timeopen.py</h3>
<p>Reads a SF 311 case file, appends a &#39;timeopen&#39; column giving the time (in minutes) a case was open.</p>
<h3>subsample.py</h3>
<p>Subsamples in the space of rows.</p>
<hr>
<h2>Shell Scripts</h2>
<p>These are simple shell scripts. They simply define variables and pipe together some commands. The input file is written into the script. The script writes to stdout and stderr. An example of a script like this (that counts words) would be:</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">DATA=../data
cat $DATA/infile.csv \
| sort \
| uniq -c \
&gt; outfile.csv
</code></pre></div>
<p>Use the hints inside of these shell scripts to complete them. &quot;Complete&quot; means that they reproduce the sample input/output inside <code>data/</code>. For example, </p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">cd scripts
./count_categories.sh &gt; /tmp/stdout 2&gt; /tmp/stderr
diff /tmp/stderr ../data/count_categories_stderr
diff /tmp/stdout ../data/count_categories_stdout
</code></pre></div>
<p>will produce two files, <code>/tmp/stdout</code> and <code>/tmp/stderr</code> and then compare them to the files in <code>data</code>. If everything is working, then <code>diff</code> should print nothing.</p>
<h3>count_categories.sh</h3>
<p>Count the number of tickets in each category</p>
<h3>count<em>categories</em>openclosed.sh</h3>
<p>Count the number of tickets in each category that are Open or Closed</p>
<h3>compute_averages.sh</h3>
<p>Compute the average time tickets in different categories remain open. </p>
<ul>
<li>For closed tickets, compute the average time it was open before being closed.</li>
<li>For open tickets, compute the time it has been left open.</li>
</ul>
<hr>
<h2>Unit Tests</h2>
<p>To run tests, cd to <em>tests/</em> and do</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">python -m unittest -v testutils
</code></pre></div>
<p>Once you are done, you will get notification that all tests passed.</p>
Midterm2013-02-06T00:00:00-08:00http://columbia-applied-data-science.github.com/exams/2013/02/06/midterm<p>We will have an in-class midterm Feb 25. It will be worth approximately the same as one homework.</p>
<p>The exam will be written (NO COMPUTERS ALLOWED!!!!) and will be designed to test:</p>
<ul>
<li>Your understanding of basic unix/python/git skills - if you have been doing the homework and following lecture you should have no issues with any of the questions.</li>
<li>Linear regression theory. It will be similar to the linear regression lecture notes and the exercises.</li>
</ul>
Announcements: Feb 062013-02-05T00:00:00-08:00http://columbia-applied-data-science.github.com/announcements/2013/02/05/announcements-feb-06<ol>
<li>The next homework assignment will be handed out (via emails) tomorrow. We will discuss this today, along with a short discussion of linear regression.</li>
<li>A visualization of the multiple levels of Git that I talked about Monday is available <a href="http://osteele.com/posts/2008/05/commit-policies">here</a></li>
<li>A visual reference to Git that goes into multiple commands is available <a href="http://marklodato.github.com/visual-git-guide/index-en.html">here</a></li>
<li>The <a href="/Exams/2013/02/06/midterm/">midterm</a> date has been set to Feb 25.</li>
</ol>
debugging2013-02-04T00:00:00-08:00http://columbia-applied-data-science.github.com/extras/2013/02/04/debugging<h2>Description</h2>
<p>A debugger is a program that allows you to follow your code as it runs. You run your code line-by-line and see exactly what is going on. This is useful for fixing bugs. It is also useful for understanding what is going on with code.</p>
<hr>
<h2>Installation</h2>
<p>Install <code>pdb++</code> using</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">pip install pdbpp
</code></pre></div>
<hr>
<h2>Trying it out</h2>
<p>Create a file called <code>test.py</code> that looks like:</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">import pdb
pdb.set_trace()
numbers = range(5)
for num in numbers:
newnumber = modify_number(num)
print newnumber
def modify_number(num):
return 3 * num
</code></pre></div>
<p>Then, from the command line type <code>python test.py</code>. Python will then start interpreting this file (as it always does). When it gets to the line <code>pdb.set_trace()</code> the debugger will &quot;hook&quot; (stop execution of your program and display the position you are at). You should see a syntax-highlighted snapshot of your code. Type <code>sticky</code> and you will see a display of all your code. Type <code>next</code> or <code>n</code> to go to the next line. Type <code>step</code> or <code>s</code> to step into the function <code>modify_number</code> (do this when you are over that line). At any point you can print out the contents of a variable by typing the name of the variable. You can quit with <code>q</code> (unless you have a variable named <code>q</code>, in which case use <code>!!q</code>. To see a full display of commands type <code>help</code>. Also, check out <a href="http://pypi.python.org/pypi/pdbpp/">this website</a>.</p>
<hr>
<h2>Customization</h2>
<p>Finally, you can customize <code>pdb++</code> by creating a <code>.pdbrc.py</code> file in your home directory. Mine looks like:</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">import readline
import pdb
class Config(pdb.DefaultConfig):
stdin_paste = &#39;epaste&#39;
sticky_by_default = True
def __init__(self):
readline.parse_and_bind(&#39;set convert-meta on&#39;)
readline.parse_and_bind(&#39;Meta-/: complete&#39;)
def setup(self, pdb):
Pdb = pdb.__class__
Pdb.do_l = Pdb.do_longlist
Pdb.do_st = Pdb.do_sticky
</code></pre></div>Material from Software Carpentry Bootcamp2013-02-02T00:00:00-08:00http://columbia-applied-data-science.github.com/extras/2013/02/02/material-from-software-carpentry-bootcamp<p>Some material from the workshops is available <a href="https://swc-nyc-session-1.readthedocs.org/en/latest/index.html">here</a>. More will be added to this post as it becomes available.</p>
Editors2013-02-02T00:00:00-08:00http://columbia-applied-data-science.github.com/extras/2013/02/02/editors<p>For simplicity and multi-platform compatibility we have been asking you to use &quot;nano&quot; when editing files or code in terminal. As you have probably noticed this is not a great editor and, of course, there are many better options. Here are some editors we like (note: when coding in python spaces and indentation is important, because this is how the interpreter deliminates logical statements, so you need to make sure that when you press tab this is seen as some number of spaces; otherwise, you will end up with a mess of python indentation errors for yourself and everyone you collaborate with):</p>
<h2>For mac:</h2>
<ol>
<li><p><a href="http://www.sublimetext.com/">Sublime</a> which you can download <a href="http://www.sublimetext.com/2">here</a>. Once you install it, click on the editor icon and do the following:</p>
<p>Open Preferences, under the Sublime Text 2 tab, and select settings-default. This should open up a bunch of code in your Sublime editor window. Search for &#39;translate<em>tabs</em>to<em>spaces&#39; and change the &#39;false&#39; to &#39;true,&#39; then search for &#39;tab</em>size&#39; and make sure that is set to 4. Save and that&#39;s it. You can see some details about these setting <a href="http://www.sublimetext.com/docs/2/indentation.html">here</a>.</p></li>
<li><p><a href="http://code.google.com/p/macvim/">MacVim</a> on which page you see download options, so choose the one appropriate for your mac. Note: this is more powerful editor but you should be familiar with its basic use; you don&#39;t actually need to download MacVim and can just use Vim which is you can invoke in terminal (type: vim or vim filename), but the standalone editor is nice.... To edit settings for vim/MacVim </p>
<ol>
<li>open ~/.vimrc (you can do this by typing: vim ~/.vimrc in terminal)</li>
<li>paste in
set tabstop=4
set shiftwidth=4
set expandtab</li>
<li>Save</li>
</ol></li>
</ol>
<h2>For Ubuntu:</h2>
<ol>
<li><a href="http://projects.gnome.org/gedit/">GEdit</a> download the latest version. Then go to Preferences, click on the editor tab and check the boxes &quot;Insert spaces instead of tabs&quot; and &quot;Enable automatic indentation;&quot; also, set the tab width to 4.<br></li>
<li><p><strong>vim</strong> is very powerful but has a steep learning curve. You can install it with:</p>
<p>sudo apt-get install vim-gnome</p></li>
</ol>
<p>Then modify your <code>.vimrc</code> as shown in the MacVim instructions.</p>
Announcements: Feb 042013-02-02T00:00:00-08:00http://columbia-applied-data-science.github.com/announcements/2013/02/02/announcements-feb-04<ol>
<li>We will start posting announcements rather than sending email for every little thing. You are responsible for checking these.</li>
<li>Material from the Software Carpentry bootcamps will be posted <a href="/Extras/2013/02/02/material-from-software-carpentry-bootcamp/">here</a></li>
<li>You should have received emails from the TA and GitHub regarding homework 1.5. If you didn&#39;t, please send your name, github username, uni, and email to the TA at <a href="mailto:zss2101@columbia.edu">zss2101@columbia.edu</a></li>
<li>See <a href="/Extras/2013/02/02/editors/">this post</a> about editors. You are expected to install a decent text editor.</li>
<li>We posted about <a href="/Extras/2013/02/04/debugging/">debugging</a> with the pdb++ debugger.</li>
<li>In homework 1.5, you should modify the top line of <code>test/testscripts.py</code> to reference the name of your particular repo. In other words, if your repo is <code>homework_1p5_team_1</code>, then change <code>homework_1p5</code> to <code>homework_1p5_team_1</code>.</li>
</ol>
Fixing your VM2013-01-31T00:00:00-08:00http://columbia-applied-data-science.github.com/extras/2013/01/31/fixing-your-vm<hr>
<h2>The most important thing</h2>
<ol>
<li>Turn on your VM and open a terminal</li>
<li>In the terminal, type <code>sudo apt-get install gnome-session-fallback</code>
<ul>
<li>This will install a new graphics manager for your desktop</li>
<li>Click <code>Y</code> when asked</li>
</ul></li>
<li>Log out (or restart)</li>
<li>When you log in, there will be a &quot;gear shaped&quot; icon near your login name. Click it and select <em>GNOME Classic (No Effects)</em>
<img src="/images/gnome-classic-login-screen.png" alt="gnome-classic"></li>
</ol>
<hr>
<h2>Memory</h2>
<p>By default, the memory allocated for the VM is only 512 MB. This is too little. Make sure Windows and your VM each have a decent amount of memory allocated.</p>
<ul>
<li>If Windows has less than 3GB, it isn&#39;t happy</li>
<li>If Ubuntu has less than 3GB, it isn&#39;t happy</li>
<li>32 Bit Ubuntu uses less memory</li>
</ul>
<hr>
<h2>Guest Additions</h2>
<p>If guest additions is not installed, then your display will be very small. Install it.</p>
Lecture notes2013-01-29T00:00:00-08:00http://columbia-applied-data-science.github.com/about-logistics/2013/01/29/lecture-notes<p>I added a <a href="/appdatasci.pdf">link to the lecture notes</a> on the <a href="/index.html">home page</a> of this website. I also posted the unix notes (they are chapter 1).</p>
head tail2013-01-29T00:00:00-08:00http://columbia-applied-data-science.github.com/extras/2013/01/29/head-tail<p>In class, someone asked me how to extract the second row of a file. I said that a Python script would be the simplest way. How wrong I was! I can&#39;t believe I missed it, given the topic of yesterday&#39;s lecture, but there is a very simple way to do this using <code>head</code>, <code>tail</code>, and a pipe <code>|</code>. </p>
<p>Suppose <code>data.csv</code> looks like:</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">name,score
ian,100000
daniel,1
mike-tyson,10
</code></pre></div>
<p>Then <code>head</code> works like:</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">$ head -n 3 data.csv
name,score
ian,100000
daniel,1
</code></pre></div>
<p>In other words, if you give it the <em>option</em> <code>-n3</code> then it returns the first three lines of the file. <code>tail</code> works like <code>head</code>, but gives the last n lines. Now...try using a pipe to tie them together and demonstrate how to extract the second line. Post your answer as a comment on this site.</p>
Office Hours2013-01-29T00:00:00-08:00http://columbia-applied-data-science.github.com/about-logistics/2013/01/29/Office_Hours<p>A short post about office hours... We will be hosting office hours
online via google hangouts. The procedure to &quot;attend&quot; office hours is
the following:</p>
<ol>
<li><p>Signup for a google/google+ account</p></li>
<li><p>During an allotted time send a message to
applied.data.science@gmail.com via Gchat and request to be added
to office hours.</p></li>
<li><p>The instructor will then invite you to the google hangout. Note:
there is a limited number of users who can be in a hangout at the
same time, so like in &quot;real&quot; office hours you might have to wait to gain
access. The instructor will notify you if this is the case. </p></li>
</ol>
<p>The office hour times are listed below: </p>
<p><strong>Online Office Hours</strong></p>
<p>Via Google+ Hang Outs: <a href="mailto:applied.data.science@gmail.com">applied.data.science@gmail.com</a></p>
<ul>
<li>Sunday 1:00 - 2:30</li>
<li>Tuesday 9:30am - 11:00am</li>
<li>Thursday 5:30 - 7:00</li>
</ul>
<p>There is also an in-person office hour with our TA Zach Shahn
<a href="mailto:zss2101@columbia.edu">zss2101@columbia.edu</a>:</p>
<ul>
<li>Friday 1:00 - 3:00, stat dept. lounge, 10th floor</li>
</ul>
Lecture Notes2013-01-28T00:00:00-08:00http://columbia-applied-data-science.github.com/about-logistics/2013/01/28/lecture-notes-preface-published<p>We are writing notes that are meant to complement our lectures. The first installment, the preface, is now available <a href="/appdatasci.pdf">here</a>.</p>
<p>Since the lecture notes will be published along with homework, look for posts about them in the &quot;Homework XX&quot; categories.</p>
Auditing the Course2013-01-26T00:00:00-08:00http://columbia-applied-data-science.github.com/about-logistics/2013/01/26/auditing<p>If you are thinking of auditing the course.... We will allow auditors as
long as there are seats in the class. Please setup your computer and
create a github account as described on this site. Then email the github
username to applied.data.science@gmail.com so that you can have access
to the repositories. Note: you will not be able to submit homework or be
graded on the code you write. Also, try to attend the Software Carpentry
workshops. </p>
Software Setup (Homework 01) Questions2013-01-24T00:00:00-08:00http://columbia-applied-data-science.github.com/extras/2013/01/24/setup-questions<hr>
<p>Be sure to also read the latest-and-greatest solution to VM problems <a href="/Homework%2001/2013/01/31/fixing-your-vm/">here</a></p>
<hr>
<p>There were many common questions that came up during the setup process. For those of you who have yet to set things up, hopefully this post will be able to save you some time and headaches:</p>
<h3>I&#39;m on OSX, how do I install Xcode?</h3>
<p>Go to the App Store and install it. Once it&#39;s installed, launch it and go to &#39;Xcode -&gt; Preferences -&gt; Downloads&#39; and make sure to install &#39;command line tools&#39;</p>
<h3>The Ubuntu VM password doesn&#39;t work.</h3>
<p>The password in the USB sticks were incorrect. It should be &quot;reverse&quot;.</p>
<h3>Where is this &quot;Terminal&quot; thing?</h3>
<p>OSX: Terminal should be in Applications -&gt; Utilities (or Other)
Ubuntu: Click on the top left ubuntu icon. This is your &quot;start menu&quot;. Type in the word &quot;terminal&quot;, and click on the search result.
Hint - you can right click on the icon in both OSes and lock Terminal to the launcher.</p>
<h3>OK, I downloaded Anaconda, what do I do?</h3>
<p>Now that you know where Terminal is, we can install Anaconda.
1. Open up a Terminal instance
2. Type <code>cd ~/Downloads</code> and hit Enter.
3. Type <code>bash Anaco</code> and hit Tab. At this point the entire file name for the Anaconda file you downloaded should appear. Now hit Enter.
4. Follow the instructions and install using default options. If you run into a problem where a single letter y scrolls down infinitely, do not panic. Just hit Ctrl-C to interrupt the process and start again by pressing Up-arrow (this is the previous command) then Enter.</p>
<h3>OK, Anaconda finished installing. Am I done?</h3>
<p>Almost. You need to configure your environment.</p>
<h5>On OSX:</h5>
<ol>
<li>Type <code>cd</code> into the Terminal and hit Enter.</li>
<li>Type <code>nano .bash_profile</code> into the Terminal and hit Enter. This brings up a text editor window in the Terminal.</li>
<li>In Nano, type <code>export PATH=$HOME/anaconda/bin:$PATH</code></li>
<li>Hit Ctrl+X, then Y, then Enter. Now you should be back in the Terminal</li>
<li>Type <code>source .bash_profile</code> and hit Enter</li>
<li>Now type <code>which python</code> and hit Enter. If the printed output includes the anaconda installation directory then you&#39;re all set.</li>
</ol>
<h5>On Ubuntu:</h5>
<ol>
<li>Type <code>cd</code> into the Terminal and hit Enter.</li>
<li>Type <code>nano .bashrc</code> into the Terminal and hit Enter. This brings up a text editor window in the Terminal.</li>
<li>In Nano, hit Enter then Up-arrow, this creates a blank line at the top of the file.</li>
<li>On the blank line, type <code>export PATH=$HOME/anaconda/bin:$PATH</code></li>
<li>Hit Ctrl+X, then Y, then Enter. Now you should be back in the Terminal</li>
<li>Type <code>source .bashrc</code> and hit Enter</li>
<li>Now type <code>which python</code> and hit Enter. If the printed output includes then anaconda installation directory then you&#39;re all set.</li>
</ol>
<h3>My VM is very very slow...</h3>
<p>Try the 32 bit version</p>
<h3>I can&#39;t get a VM to work...what should I do?</h3>
<ul>
<li>Try <a href="http://www2.epcc.ed.ac.uk/%7Emichaelj/SoftwareCarpentry/">this VM</a></li>
<li>Try a <a href="https://help.ubuntu.com/community/WindowsDualBoot">dual-boot</a> setup</li>
</ul>
Software Carpentry Bootcamps2013-01-23T00:00:00-08:00http://columbia-applied-data-science.github.com/extras/2013/01/23/software-carpentry-workshops<p>Unless you are already proficient at unix/git/python/unit-tests, this course will be VERY difficult.</p>
<p>To help ease you into things, we have organized some <a href="http://software-carpentry.org/bootcamps/2013-01-columbia.html">software carpentry
bootcamps</a>.
Attendance is highly encouraged, as without a level of comfort with the
material presented the course cannot be completed. Sign up <a href="https://docs.google.com/spreadsheet/ccc?key=0Ah-Zcg2-sH_kdFJKOGthcG9KVEFyOXpoTjBJMk11UlE#gid=0">here</a></p>
<hr>
<p>Dates: Attend either Jan 30, 31 OR Feb 1,2 (slots may fill up)</p>
<p>Times: 9am - 4:30pm</p>
<p>Location: 414 and 750 <a href="http://facilities.columbia.edu/building-information/1066">Shapiro CEPSR</a> (Interschool lab). See the <a href="https://docs.google.com/spreadsheet/ccc?key=0Ah-Zcg2-sH_kdFJKOGthcG9KVEFyOXpoTjBJMk11UlE">signup sheet</a>.</p>
Class waitlist2013-01-23T00:00:00-08:00http://columbia-applied-data-science.github.com/about-logistics/2013/01/23/class-waitlist<p>There is a waitlist for this course. It is <a href="https://docs.google.com/spreadsheet/viewform?formkey=dG1NLUI4emRtVHNxUHpETktlc095VXc6MA..#gid=0">here</a>.</p>
<p>We cannot admit more than the maximum (100) people. So this is the only way to get added.</p>
DevFest2013-01-21T00:00:00-08:00http://columbia-applied-data-science.github.com/data%20science%20activities/2013/01/21/data-event-devfest<p>Here&#39;s a great chance to work on developing some new products. They&#39;re looking for hackers or people with data knowhow.</p>
<p>DevFest is a week-long development festival during which students build applications, experiment with new technologies, and compete for awesome prizes. DevFest will kick off with a pitchfest and team formation on Saturday, February 2, followed by workshops and hacking time. The week will continue with a technical workshop and hacker office hours every night. DevFest will finish strong with an all-night hackathon from Friday, February 8th - Saturday 9th, after which the apps will be demoed to a panel of judges and prizes will be awarded. For more details see http://adicu.com/devfest</p>
New course: Computational Social Science2013-01-19T00:00:00-08:00http://columbia-applied-data-science.github.com/data%20science%20activities/2013/01/19/new-course-computational-social-science<p>I just received word of an exciting new course offering through the Applied Math Department. The course, <a href="http://compsocialscience.org/">Computational Social Science</a> is being taught by <a href="http://5harad.com/">Sharad Goel</a>, <a href="http://jakehofman.com/">Jake Hofman</a>, and <a href="http://theory.stanford.edu/%7Esergei/">Sergei Vassilvitskii</a>. I have heard Jake lecture before and in addition to being technically very strong, he is quite a good speaker.</p>
Bicoastal Datafest: Analyzing money's influence in politics2013-01-06T00:00:00-08:00http://columbia-applied-data-science.github.com/data%20science%20activities/2013/01/06/bicoastal-datafest<p>Meet journalists, scientists, engineers, data experts, and developers for a cross-disciplinary and bicoastal weekend of brainstorming, data-diving, story telling and civic action, not to mention prizes, food, and fellowship.</p>
<p>For more information, see <a href="http://www.bdatafest.computationalreporting.com/">http://www.bdatafest.computationalreporting.com/</a></p>
merry christmas linux laptop2012-12-21T00:00:00-08:00http://columbia-applied-data-science.github.com/extras/2012/12/21/merry-christmas-linux-laptop<p>Here&#39;s an idea for those of you stuck with a crappy Windows machine. For Christmas, get a new Linux laptop from <a href="https://www.system76.com/laptops/">System76</a>.</p>
<p>I got the 14 inch <a href="https://www.system76.com/laptops/model/lemu4">Lemur</a> with 16GB memory, 4 cores, 512 GB solid state drive and all the accessories...it set my new employer back $1400 (less than half what a comparable Macbook would cost). A cheaper, but still acceptable machine for this class (remember to get at least 8GB memory!) can be put together for $700.</p>
<p>The advantage of Linux is that it is the absolute easiest OS for installing hacking/programming/scientific-computing software. The latest and greatest stuff is often made first for Linux, then ported to OSX. The downside is that it can be difficult to get all the hardware working correctly (especially on a laptop). That&#39;s the reason to buy from a vendor who pre-installs Linux. They guarantee hardware compatibility. Note that some things, like YouTube, will still run into glitches every now and then. The glitches get fixed...but it&#39;s not just plug and play like a Macbook.</p>
<p>Here are some other places you can get pre-installed Ubuntu Linux.</p>
<ul>
<li><a href="http://zareason.com/shop/Laptops/">ZaReason</a> is similar to System76. Their <a href="http://zareason.com/shop/UltraLap-430.html">UltraLap 430</a> almost won me over with its small size and weight.</li>
<li>Dell&#39;s <a href="http://content.dell.com/us/en/gen/d/campaigns/xps-linux-laptop">Project Sputnik</a> campaign has put together a high end, super thin, super light machine.</li>
<li>Thinkmate&#39;s <a href="http://www.thinkmate.com/Computer_Systems/Workstations/Workstations/Series/HPX">HPX series workstations</a> are complete overkill and unnecessary for this class, but fun to look at.</li>
</ul>
<hr>
<h3>Update</h3>
<p>After owning the computer for two months, I have this report:</p>
<ul>
<li>I LOVE it</li>
<li>For scientific computing, it works better &quot;right out of the box&quot; than my co-worker&#39;s Macs</li>
<li>The keyboard/trackpad isn&#39;t that good...but I use an external keyboard/mouse</li>
<li>After installing some extra &quot;32-bit enabling&quot; libraries, Skype works perfectly</li>
<li>YouTube works perfectly</li>
<li>Amazon&#39;s streaming movies don&#39;t work</li>
</ul>
winter break fun2012-12-20T00:00:00-08:00http://columbia-applied-data-science.github.com/extras/2012/12/20/winter-break-fun<p>Here are some things you can do to get a jumpstart on the class.</p>
<h2>Everyone</h2>
<ul>
<li>Motivate yourself to use unix
<ul>
<li>Hole Hawg <a href="http://www.team.net/mjb/hawg.html">story</a> (this is what converted me)</li>
<li>The <a href="http://en.wikipedia.org/wiki/Unix_philosophy">unix philosophy</a></li>
</ul></li>
<li>Set up your computer for this course. Instructions <a href="/Lessons/2012/12/20/computer-setup">here</a>. WARNING! This may be difficult. If you are stuck, then get help from a friend (I can&#39;t help now), or wait until class starts.</li>
</ul>
<h2>Beginner</h2>
<ul>
<li>Software Carpentry <a href="http://software-carpentry.org/4_0/index.html">Lessons</a>
<ul>
<li><a href="http://software-carpentry.org/4_0/python/intro.html">Python</a></li>
<li>The unix <a href="http://software-carpentry.org/4_0/shell/index.html">shell</a></li>
</ul></li>
<li>Python introductory <a href="http://www.codecademy.com/tracks/python">course</a></li>
<li>Unix shell <a href="http://www.ee.surrey.ac.uk/Teaching/Unix/">tutorial</a></li>
</ul>
<h2>Intermediate</h2>
<ul>
<li><a href="http://learn.github.com/p/intro.html">Git</a>
<ul>
<li>Clone the <a href="https://github.com/columbia-applied-data-science/homework-01">first homework repo</a> and try to understand it</li>
<li>The first homework will ask you to modify this utility to do other useful things</li>
</ul></li>
<li><a href="http://software-carpentry.org/4_0/softeng/index.html">Software engineering</a></li>
<li><a href="http://software-carpentry.org/4_0/regexp/index.html">Regular expressions</a></li>
</ul>
Homework 01: Computer Setup for Applied Data Science Course2012-12-20T00:00:00-08:00http://columbia-applied-data-science.github.com/homework/2012/12/20/computer-setup<p>Note: People not in this Course, but who <em>are</em> participating in the <a href="http://software-carpentry.org/bootcamps/2013-01-columbia.html">software carpentry bootcamp</a>, should instead follow <a href="/swcsetup.html">these instructions</a></p>
<p><a id="beforeclass"></a></p>
<h1>Before First Class</h1>
<p>Bring your computer to class so we can help you set things up</p>
<p>You should download the following before coming to the first class on <em>Wednesday, January 23rd, 2013</em>:</p>
<ul>
<li>A version of <a href="https://store.continuum.io/cshop/anaconda">Anaconda</a> appropriate for you machine</li>
<li>If you have Windows, download and unzip either the 32 or 64 bit <a href="http://virtualboxes.org/images/ubuntu/#ubuntu1210">VM image</a>. See <a href="#death">this explanation</a> about 32 vs. 64 bit).
<ul>
<li>If you download the versions with <em>guest-additions</em> pre-installed you can save yourself a little bit of work</li>
</ul></li>
<li>If you have a mac, download Xcode
<ul>
<li>First try getting it from the app store</li>
<li>If this doesn&#39;t work (due to an older OSX), you have to <a href="https://developer.apple.com/programs/register/">register as a developer</a></li>
</ul></li>
</ul>
<hr>
<h1>Software to install</h1>
<h2>Overview</h2>
<ul>
<li>Python distribution
<ul>
<li>Anaconda or EPD</li>
<li>For <a href="#linux">Linux</a> users</li>
<li>For <a href="#mac">Mac</a> users</li>
<li>For <a href="#death">Windows</a> users</li>
</ul></li>
<li><a href="#lib">Editor</a>
<ul>
<li>vim-gnome or macvim</li>
</ul></li>
<li>Version control
<ul>
<li><a href="#git">git</a></li>
</ul></li>
<li>Additional <a href="#lib">libraries</a>
<ul>
<li>pdbpp</li>
<li>pep8</li>
<li>line_profiler</li>
</ul></li>
</ul>
<h2>Motivation</h2>
<p>Installing software and setting up your system for this class can be quite easy,
or very very difficult. This is based on your OS, existing environment, and
random chance. During the first week of class, we will have dates/times
dedicated to helping you set up your system. After these dates, you are almost
on your own. Although you can find instructions on the Internet, they often
don&#39;t work exactly as stated.</p>
<h2>Supported Configurations</h2>
<ul>
<li>To the best of our ability, we will support <a href="https://www.ubuntu.com">Ubuntu</a> Linux and Mac OSX operating systems along with the <a href="https://store.continuum.io/cshop/anaconda">anaconda</a> (anaconda handles all of your Python package needs).</li>
<li>This class will require use of Linux utilities. Standard Microsoft Windows will not work.</li>
</ul>
<hr>
<h2>Installing Supported Python Configurations</h2>
<p><a id="linux"></a></p>
<h4>If you have Linux</h4>
<ul>
<li>Install <a href="https://store.continuum.io/cshop/anaconda">Anaconda CE</a>.</li>
<li>Hints
<ul>
<li>Try to install in your HOME directory (default) so you don&#39;t need sudo</li>
<li>Don&#39;t invoke the installer shell using sudo if installing into HOME directory</li>
<li>Remember to configure your <a href="#env">environment</a></li>
<li>Remember to read the <a href="http://docs.continuum.io/anaconda/install.html">documentation</a></li>
</ul></li>
<li>Install <a href="#git">Git version control</a>, and additional packages including <a href="#lib">VIM and other Python packages</a></li>
</ul>
<p><a id="mac"></a></p>
<h4>If you have a Mac</h4>
<ul>
<li>You need to first install xcode,
<ul>
<li>xcode can be installed by going to the App Store. You need to install xcode, then goto the top left of your screen and click XCode -&gt; Preferences -&gt; Downloads, find &quot;command line tools&quot; and click install.</li>
<li>xcode is a 1GB+ download so you will not have time to download it in class on Wednesday</li>
<li>Note that the current version of xcode is only supported by OSX 10.7.4+ so we highly recommend you upgrade your operating system. If for whatever reason you absolutely cannot upgrade your os, you need to register for a free Apple developer account and download the appropriate version of xcode</li>
</ul></li>
<li>For 64 bit OSX, install <a href="https://store.continuum.io/cshop/anaconda">Anaconda CE</a>
<ul>
<li>By default this is installed in your home directory. Unless you know what you&#39;re doing, don&#39;t change it.</li>
</ul></li>
<li>For 32 bit OSX, install <a href="https://www2.enthought.com/accounts/register/?next=/licenses/academic">EPD academic</a></li>
<li>Remember to read the <a href="http://docs.continuum.io/anaconda/install.html">documentation</a></li>
<li>Remember to configure your <a href="#env">environment</a></li>
<li>Install <a href="#git">Git version control</a>, and additional packages including <a href="#lib">VIM and other Python packages</a></li>
</ul>
<p><a id="death"></a></p>
<h4>If you have Windows</h4>
<p>We will set you up with a Linux virtual machine. You can then follow the Linux instructions</p>
<ul>
<li>Use a 32 bit Ubuntu Linux VM if you have 4-6GB of memory</li>
<li><p>Use a 64 bit Ubuntu Linux VM if you have &gt; 6GB of memory</p></li>
<li><p>Download and unzip either the 32 or 64 bit <a href="http://virtualboxes.org/images/ubuntu/#ubuntu1210">VM image</a>. </p>
<ul>
<li>If you download the versions with <em>guest-additions</em> pre-installed you can save yourself a little bit of work</li>
</ul></li>
<li><p>Download <a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox</a></p></li>
<li><p>Run the installer</p></li>
<li><p>Open VirtualBox Manager</p></li>
<li><p>Click &quot;New&quot; to create new virtual machine</p></li>
<li><p>Select &quot;Linux&quot; for Type and &quot;Ubuntu&quot; or &quot;Ubuntu (64-bit)&quot; for Version</p></li>
<li><p>Next, allocate memory for your VM. If you have X total GB of RAM, and you allocate Y to your VM, then Windows has X - Y left over for itself. You must balance the needs of both Ubuntu and Windows. Here are some hints.</p>
<ul>
<li>64 bit Windows is unhappy with less than 3 GB</li>
<li>64 bit Ubuntu is unhappy with less than 3 GB</li>
</ul></li>
<li><p>Next, select &quot;Use an existing virtual hard drive file&quot; and select the VDI file you downloaded</p></li>
<li><p>Once the VM has been created, select it and click &quot;Start&quot;</p></li>
<li><p>If the VM image you downloaded already has <em>guest-additions</em> installed then you can skip this step. Otherwise once the setup is complete you need to install <a href="http://virtualboxes.org/doc/installing-guest-additions-on-ubuntu/">guest additions</a>.</p></li>
<li><p>The default keyboard layout is Italian. To change this</p>
<ul>
<li>Click System Settings</li>
<li>Keyboard Layout</li>
<li>Hit the &quot;+&quot; button</li>
<li>Select a new layout from the list</li>
</ul></li>
</ul>
<p>Now change the window manager <strong>IMPORTANT!!</strong></p>
<ol>
<li>Turn on your VM and open a terminal</li>
<li>In the terminal, type <code>sudo apt-get install gnome-session-fallback</code>
<ul>
<li>This will install a new graphics manager for your desktop</li>
<li>Click <code>Y</code> when asked</li>
</ul></li>
<li>Log out (or restart)</li>
<li>When you log in, there will be a &quot;gear shaped&quot; icon near your login name. Click it and select <em>GNOME Classic (No Effects)</em>
<img src="/images/gnome-classic-login-screen.png" alt="gnome-classic"></li>
</ol>
<p><a id="env"></a></p>
<h3>Extra Help: Configuring environment variables</h3>
<ul>
<li>Modify your shell configuration file, henceforth referred to as your <em>bashrc file</em>.
<ul>
<li>Mac OSX: From your home directory (i.e., ~/) open either .bash<em>profile or .bash</em>aliases (create one if neither exists), add <code>export PATH=$HOME/anaconda/bin:$PATH</code></li>
<li>Linux: From your home directory (i.e., ~/) open .bashrc (create one if neither exists), add &quot;export PATH=/path/to/python:$PATH&quot;</li>
</ul></li>
<li>Refresh your terminal by typing
<code>
source ~/.bashrc
</code>
or just opening a new terminal.</li>
</ul>
<p><a id="check"></a></p>
<h3>Verify Things</h3>
<ul>
<li><p>Open a terminal and start IPython with:
<code>
ipython --pylab
</code></p>
<ul>
<li>Verify numpy with
<code>
import numpy
</code></li>
<li>To check pandas and matplotlib, from IPython, type</li>
</ul>
<div class="highlight"><pre><code class="text language-text" data-lang="text">from pandas import Series;
Series(randn(10)).plot()
</code></pre></div></li>
<li><p>Verify the notebook:</p>
<ul>
<li><code>ipython notebook --pylab=inline</code> should pop up a browser window and show the notebook dashboard</li>
</ul></li>
<li><p>Verify your PATH setting:</p>
<ul>
<li><code>which python</code> should show the directory in which you installed Anaconda/EPD as the first entry</li>
</ul></li>
</ul>
<p><a id="lib"></a></p>
<h3>Install additional software</h3>
<ul>
<li>Ubuntu users should use &quot;apt-get&quot; command to install software packages. The syntax is &quot;sudo apt-get install <package1> <package2> ...&quot;</li>
<li>VIM
<ul>
<li>Linux: &quot;sudo apt-get install vim vim-gnome&quot;</li>
<li>Mac: <a href="http://macvim.org/OSX/index.php">download</a> Macvim and follow installation instructions</li>
</ul></li>
<li>Other (easier/weaker) editors
<ul>
<li>Linux: &quot;sudo apt-get install gedit-plugins&quot;</li>
<li>Mac: Download and install sublime text</li>
</ul></li>
<li><p>Python libraries</p>
<div class="highlight"><pre><code class="text language-text" data-lang="text">pip install pdbpp line_profiler pep8
</code></pre></div>
<ul>
<li>If you don&#39;t have pip, install it first using <code>easy_install pip</code></li>
</ul></li>
</ul>
<hr>
<p><a id="git"></a></p>
<h2>Set up version control</h2>
<ul>
<li>Sign-up for free Github account
<ul>
<li>Send username, email address, and uni to Zach Shahn <a href="mailto:zss2101@columbia.edu">zss2101@columbia.edu</a></li>
</ul></li>
<li>Install git
<ul>
<li>Mac: download from mac.github.com</li>
<li>Linux: type <code>sudo apt-get install git</code></li>
</ul></li>
</ul>
<hr>
A few words about what this class indends to teach you2012-12-20T00:00:00-08:00http://columbia-applied-data-science.github.com/about-logistics/2012/12/20/a-few-words-about-what-this-class-indends-to-teach-you<hr>
<h2>What this class is not</h2>
<p>This class is not a traditional statistics course, although much of the material
will be rooted in statistical analysis. This class is not a computer
science course, although you will program a lot and hopefully become
better at doing so in different environments. And, this class is not a
machine learning course, although ML techniques will be foundational
material for the lectures. You will not have clean data sets to start
with all the time; you will get data sets as we have seen them working
in data science space for the last few years. The data sets might be
messy and unstructured, and it might not always be clear how to extract
the relevant signals for the problem at hand. However, this is part of
the fun and we hope you will agree by the time May rolls around. </p>
<hr>
<h2>What this class is</h2>
<p>This class is an introduction to the collection of techniques we have
found indispensable when working in the data science space. There will
be significant emphasis on understanding the relevant statistics and
proper application thereof. It will
teach you to write good code and use collaborative tools to do so,
because after all if you intend to build things for other people to use there is
no other option. We will talk about staple machine learning algorithms
and techniques, going into some depth about the background mathematics,
but will always revert back to implementing those techniques into python
libraries to be used for subsequent data analysis. Sometimes you will have to
find, get, process and clean data before taking initial steps in any
kind of statistical modeling. In short, you will have a taste of the
day-to-day in the data science world, and walk away with the foundational
knowledge and toolkit that will allow you to build solutions in this
relatively new and exciting area.</p>
Course. Data Science and Technology Entrepreneurship2012-11-30T00:00:00-08:00http://columbia-applied-data-science.github.com/data%20science%20activities/2012/11/30/course-data-science-and-technology-entrepreneurship<p><a href="http://www.columbia.edu/%7Echw2/">Chris Wiggins</a> informed me of a course that may be of interest to some of you. See <a href="https://boss.gsb.columbia.edu/registrar-student/SnippetPage.tap;jsessionid=F1595A34B784F05F3ED2C7521C460E86?sp=CourseSchedule">this page</a> for updated information including the <a href="http://angel.gsb.columbia.edu/AngelUploads/Content/8848-001-20131/_syllabus/Maskey.pdf">syllabus</a>. A brief description is below.</p>
<hr>
<h2>Data Science And Technology Entrepreneurship</h2>
<p>&quot;Offered jointly between Columbia Business School and Computer Science Department&quot;</p>
<p>This course will pair up MBA students from Columbia Business School with Master’s/PhD students from Computer Science department to form teams of two (or more) who will be guided through an entrepreneurial experience of building a technology startup. The course will be very hands on! The course will also have a team of 12 Industry Advisors/Mentors (CEOs, CTOs and VC Partners of various firms) who will engage with students to help them convert their idea into a sustainable technology business.</p>
<p>Data Science is an emerging interdisciplinary field across statistics, computer science and business. The course will not only focus on theoretical aspects of data sciences but also on applying them in building products and improving business processes. Student teams (composed of CS/Engineering and Business students) will use data driven methods to test feasibility of the idea/innovation, build the product, develop customers, study sales channels and try to raise capital during the span of 4 months. Industry mentors will critique the student teams and their ideas through various stages of the startup implementation addressing such questions related to feasibility, market attractiveness, customer acquisition, metrics, launch strategy and more. The students will be able to interact with CEOs for business mentorship, CTOs for technical mentorship and VC firm partners for advice on the capital raising process.</p>
Course Proposal2012-11-15T00:00:00-08:00http://columbia-applied-data-science.github.com/about-logistics/2012/11/15/course-proposal<h1>This post is no longer current</h1>
<h2>Please see the <a href="/description.html">course description</a></h2>
<h2>Course basics</h2>
<ul>
<li>Number: STAT 4249</li>
<li>Class times: MW 6:10 - 7:25</li>
<li>Lead instructor: Ian Langmore, <a href="mailto:ianlangmore@gmail.com">ianlangmore@gmail.com</a></li>
</ul>
<h2>This class is...</h2>
<ul>
<li>for people with an understanding of statistics at the first-year graduate level or beyond</li>
<li>a way to learn/write basic algorithms for statistical inference and predictive analytics</li>
<li>a chance to apply algorithms to real data sets and gain data science intuition</li>
<li>a way to learn solid programming skills (beginning through intermediate)
<ul>
<li>Python</li>
<li>Linux</li>
<li>Github</li>
<li>Collaborative development</li>
<li>Object Oriented Design</li>
</ul></li>
</ul>
<h2>This class is not...</h2>
<ul>
<li>for people who don&#39;t know any stats or linear algebra</li>
<li>for people who have never programmed before</li>
<li>an overview of advanced methods in machine learning</li>
</ul>
<h2>Full description</h2>
<p>The explosion of available data coinciding with the continued evolution of statistical and computational methods has resulted in a new breed of specialist. These data scientists use rigorous statistical methods to find meaning in data. Minimizing a loss function is not enough: Business and societal decisions hinge on the interpretation of these insights. The world of scientific computation is rapidly evolving. Quick-and-dirty scripts are not enough: A maintainable code base and collaborative development environment allows projects to productionalize and scale. A data scientist must wear many caps, we present two of them here.</p>
<p>Maintainable coding techniques will be taught using test-driven-development, version control, and collaboration. Code will be of the type found in the <a href="http://scikit-learn.org/stable/">scikit-learn</a> and <a href="statsmodels.sourceforge.net">statsmodels</a> packages. Students finish the class having created a library on <a href="github.com">GitHub</a>, and an understanding of several core statistical/machine-learning algorithms.</p>
<p>Case studies give students the opportunity to use these their own software on real world data sets. Here they develop intuition for extracting meaning from data. Students finish the class with a website/blog/portfolio, and experience with the translation:</p>
<p>Real world --&gt; data --&gt; scientist --&gt; collaborators/coworkers --&gt; policy-decision/data-product</p>
<h2>Lecture structure</h2>
<ol>
<li>An algorithm is presented. Students are randomly assigned to groups and together write a productionalizable implementation.</li>
<li>The class is presented with a data-driven business/scientific problem that a company/institution has, and they must solve (using the algorithm from 1).
<ul>
<li> Each step takes one week.</li>
<li> Step 1 demands that a GitHub repo be created. The repo is maintained with the imaginary goal of being later productionalized for a client. This problem is very clearly defined. The goals here are to learn algorithms and scientific computing skills in a collaborative environment.</li>
<li> Step 2 demands the creation of a presentation and a written report. One group is randomly chosen to present their pitch/solution to the class. The problem will not necessarily be clearly defined. Students must find where they can add value, then convince us that they can. Students use software developed in step 1, along with other packages.</li>
<li> The data for step 2 will come from NYC start-ups and non-profits.</li>
<li> Will use the book <a href="http://shop.oreilly.com/product/0636920023784.do">python for data analysis</a>.</li>
<li> May use the book <a href="http://www.manning.com/pharrington/">machine learning in action</a>. If so, we will require modifications of the algorithms presented there.</li>
</ul></li>
</ol>
<h2>Prerequisites</h2>
<ul>
<li>Stats (4109 or 4105+4107) or equivalent</li>
<li>Some proficiency in programming</li>
<li>Computer:<br>
<ul>
<li>A mac or Linux is fine.</li>
<li>If you have Windows, we will assist you in setting up a Linux dual-boot or virtual machine.
*An 8GB machine with help you tremendously. A 2GB machine will cause headaches. Spend the $60 and upgrade...you want to analyze data right?</li>
</ul></li>
</ul>
<h2>Lectures/algorithms/HW</h2>
<ol>
<li>Introduction
<ul>
<li>Course introduction</li>
<li>Software setup workshops
<ul>
<li>If you have Linux, then we will do a quick check of your system</li>
<li>If you have a Mac, we will transform it into a <em>real</em> mac</li>
<li>If you have a Windows machine, we will set up Linux with either a virtual machine or dual-boot.</li>
</ul></li>
</ul></li>
<li>Programming introduction
<ul>
<li>Python introduction</li>
<li>Unix introduction</li>
<li><a href="http://software-carpentry.org/">Software carpentry</a> workshops</li>
</ul></li>
<li>Data Tools
<ul>
<li>Git/Github introduction</li>
<li>Teams build a suite of data tools
<ul>
<li>Cleaning filters</li>
<li>Subsampling</li>
<li>SQL scripts</li>
</ul></li>
</ul></li>
<li>Exploratory data analysis
<ul>
<li>Pandas</li>
<li>Numpy, scipy, matplotlib</li>
<li>Build an EDA suite </li>
</ul></li>
<li>Linear regression
<ul>
<li>The singular value decomposition (SVD)</li>
<li>Maximum likelihood</li>
<li>Regularization and Bayesian estimators</li>
<li>Memory hierarchy, stability, and why you never explicitly invert a medium or large matrix</li>
<li>Teams build a linear regression module</li>
<li>Teams work on case study (topic TBD)</li>
</ul></li>
</ol>
<p>Other algorithms presented will follow the same structure as &quot;Linear regression&quot; above, and could include:</p>
<ul>
<li>Logistic regression/classification</li>
<li>K-nearest neighbors</li>
<li>Kernel density estimation</li>
<li>Decision trees, random forests</li>
<li>Monte Carlo simulation</li>
<li>Recommendation systems</li>
</ul>
<p>Possible additional topics</p>
<ul>
<li>Web scraping</li>
<li>Typing and compilers. Could be taught by using Cython.</li>
</ul>