Research & Ramblings

Let me end with a spoiler: The correct way to mow your lawn is to pay someone else to do it. Period. But I'm a data scientist so I'm going to show you how to spend all your time planning how to do it instead. You won't actually get to start mowing because this method takes more time than you'll spend mowing your lawn all summer. Possibly literally.

I hate mowing so I sat down to plan out how to minimize the amount of time I spent mowing. This planning time of course was a thinly veiled attempt at procrastination thus it's actually a multi-objective optimization problem and solution.

What interested me in this problem was finding a real world use case for solving some small versions of the Traveling Salesperson Problem. I also thought the wrinkle about minimizing turns was an interesting modification to figure out.

Chris Albon tweeted an image of an undirected graph that he
used to justify the number of cables he needed to his partner. She of course
responded he was a nerd (subtext). I can only assume that her nerd-shaming of Albon and
his lack of time are probably the reasons why he didn’t take this to the next
level: A linear programming problem made to minimize the amount of money spent
on cables!

Once you’re done reading this post, show it to your
significant other and watch them roll their eyes in exasperation at your
ridiculous need to turn even the simplest of everyday tasks into a complex
problem worthy of that chip on your shoulder!

Tags

If you've read Thomas Wiecki's awesome introduction to hierarchical models (here: http://twiecki.github.io/blog/2014/03/17/bayesian-glms-3/) you probably started working to figure out how to apply the technique to your own work. A challenge you may run into pretty quickly is how do you go about modeling N different levels? Even more challenging is how do you model N-levels and also keep the model vectorized? This post will be fairly terse and will illustrate how to actually set this up in PyMC3.

I've generated a small dataset to act as an example here. Let's pretend we have data on how much money the average person in different states made over time and counties made and we have information on whether or not they got a degree. Using hierarchical modeling, we can relate these different factors to understand their impact on income. As I said, this data is all generated solely to illustrate this technical concept, for one blog post try to not think critically about the data (or do and expect to have some good laughs).

If you want to see how to generate the example data, I have included that at the end of this post.

WARNING: My main goal is to put an example online. This post assumes an understanding of PyMC3, Hierarchical modeling, pandas, etc.

Tags

You've probably come across the Dirichlet Distribution if you've done some work in Bayesian Non-Parametrics, clustering, or perhaps even statistical testing. If you have and you're like I was you may have wondered what this magical thing is and why it gets so much attention. Maybe you saw that a Dirichlet process is actually infinite and then wondered well how is that going to be useful?

I think I've found a very intuitive approach... it's at least quite different from any other I've read. This post requires that you already be familiar with Beta distributions, PDF functions, and Python to get along. If you meet that requirement then grab a bag of nuts and let's jump right in.

You've seen the articles that say "MCMC is easy! Read this!" and by the end of the article you're still left scratching your head. Maybe after reading that article you get what MCMC is doing... but you're still left scratching your head. "Why?"

"Why do I need to do MCMC to do Bayes?"

"Why are there so many different types of MCMC?"

"Why does MCMC take so long?"

Why, why, why, why, etc. why.

Here's the point: You don't need MCMC to do Bayes. We're going to do Bayesian modeling in a very naive way and it will lead us naturally to want to do MCMC. We'll understand the motivation and then! We'll also better understand how to work with the different MCMC frameworks (PyMC3, Emcee, etc) much better because we'll understand where they're coming from.

We'll assume that you have a solid enough background in probability distributions and mathematical modeling that you can Google and teach yourself what you don't understand in this post.

Tags

Here's the thing

I'm going to make this quick. You do a carefully thought through analysis. You present it to all the movers and shakers at your company. Everyone loves it. Six months later someone asks you a question you didn't cover so you need to reproduce your analysis...

But you can't remember where the hell you saved the damn thing on your computer.

If you're a data scientist (especially the decision sciences/analysis focused kind) this has happened to you. A bunch. You might laugh it off, shrug, w/e but it's a real problem because now you have to spend hours if not days recreating work you've already done. It's a waste of time and money.

Fix it.

I used to be this person too, so I get it. I decided to experiment with a new method that sounds so simplistic and stupid you'll think it won't work.

In the previous post, we gathered data from an uncontrolled test that a store owner ran. We were able to do that because we assumed that everyday was perfectly comparable. If you've ever looked at how business metrics move though you know that each day is not at all directly comparable. Each day of the week brings in different numbers of visitors consistently, different times of the year may be busier than others, etc. In this post, we're going to examine what it would take to test a change to a business without having a true control.

Controlled testing in modern systems is fairly straight-forward. There are many tools to handle statistical analysis, random population sampling, data collection, etc. With the number of web visitors routinely numbering in the millions, the statistical techniques are also greatly simplified because of the preponderance of data. What do we do though when data is very costly?

In this post, we'll take an overly simplified model and solve it using response surface modeling to find an approximate optimum with very little data. At the end of the post we'll discuss some possible caveats and some ideas for getting around those caveats.

Tags

Problem

Say you run a video store and you want to understand how customer rental frequencies are changing over time. You can just plot the numbers but you want some help identifying when a customer's usage really went up and when it came down. Just looking at data you see that happening every day. What are the real changes though?

If you charted the number of rentals per month for a given customer over many months you might see this:

You probably can see an uptick somewhere around a year into their history... but where exactly? And is it safe to say it's constant even though we see a fall afterwards? Definitely we could give a pretty good guess, but we can't automate gut feel. How do we answer these questions quantitatively without needing a human?