I recently predicted grocery sales for a kaggle competition.
In this competition, we were responsible for using data from six tables to predict how many units of different items
would sell on future dates. This competitions presented several challenges, including merging multiple tables, working with
a data frame that was larger than RAM, and working with categorical variables that had many classes. This is part one, where
I discuss how I dealt with the large data frame. I will discuss my handling of categorical variables with h2o in
part 2. I will update this post with a link when part 2 is available.

If you have a twitter feed like mine (i.e., nerdy) you can hardly go a day without seeing some mention of “deep learning.” In fact a quick glance at google anayltics
shows that searches for deep learning have been rising over the past 5 years. I included “linear regression” to have a point of comparison. (You’ll note the famous
“people search for this more when school is in session” trend associated with linear regression.)

Everyone knows that matlab is terrible, and I never want to use it again once I get out of this rattrap. But in order to do some serious data work in the serious world, you need to use a combination of Python and SQL. On the third hand, I couldn’t just throw my grad school life away and break free (I tried that, and it didn’t work).

About us

We are a collection of Psychology and Neuroscience graduate students
from UC Davis who are interested in data science, user experience, and local beer. Our shared
goal is to help each other prepare for a life (i.e. job) outside of academia,
or perhaps, take a more modern approach to a life inside. You can read the latest
blog post to the left, find older posts in the archive, or check
out some of our projects.