Many people are uncomfortable with numbers, and even more don’t really understand statistics. It’s very, very easy to overwhelm people with numbers, charts, and tables – and yet numbers are more important than ever. The trend toward running companies in a data-driven way is only growing…which means more programmers will be spending time building data products. These might be internal reporting tools (like the dashboards that your CEO will use to run the company) or, like Mixpanel, you might be building external-facing data analysis products for your customers.

Either way, the question is: how do you build usable interfaces to data that still give deep insights?

We’ve spent the last 6 years at Mixpanel working on this problem. In that time, we’ve come up with a few simple rules that apply to almost everyone:

On Monday we shipped distinct_id aliasing, a service that makes it possible for our customers to link multiple unique identifiers to the same person. It’s running smoothly now, but we ran into some interesting performance problems during development. I’ve been fairly liberal with my keywords; hopefully this will show up in Google if you encounter the same problem.

The operation we’re doing is conceptually simple: for each event we receive, we make a single MySQL SELECT query to see if the distinct_id is an alias for another ID. If it is, we replace it. This means we get the benefits of multiple IDs without having to change our sharding scheme or moving data between machines.

A single SELECT would not normally be a big deal – but we’re doing a lot more of them than most people. Combined, our customers have many millions of end users, and they send Mixpanel events whenever those users do stuff. We did a little back-of-the-envelope math and determined that we would have to handle at least 50,000 queries per second right out of the gate.Continue reading →

When we started Mixpanel, we used amCharts, a pretty full-featured Flash-based charting library. This wasn’t ideal though – it’s closed-source and, well, it’s Flash. We ultimately switched over to pure Javascript charts and it was a great decision.

Now if something wonky happens, I can easily modify the library code. We also get the added benefit of broader platform support – you can use mixpanel.com on your mobile device and it works perfectly.

Actually picking the library was a little tricky. We were lucky – highcharts was released right when we started looking and it has performed admirably. There are a few other good choices though, and I will go into all of them in some depth.

In my previous post covering OpenVPN, I said that we needed to restrict access to most of our servers – they will only be accessible to each other, rather than open to the outside world.

How do we do this? iptables. You can add iptables rules that explicitly state the ip addresses that are allowed through the firewall, and then disallow everything else.

If our network was static – meaning we would never have to add more machines – then this would be really simple. All you’d need to do is update your iptables file once with the ip of every server you own, and you’re done. No worries.

In the real world, the network isn’t static. We’re adding new machines all the time, and if we don’t update iptables at the same time, the new machines won’t be able to communicate with the old ones. To solve this problem, I dynamically generate iptables files and deploy them with Fabric.

When I want to deploy code to http://mixpanel.com, I open a new terminal window and type fab deploy. Even though we have quite a few servers these days, our deploy process is really streamlined – we push code multiple times per day.

When you’re first starting a new web project, deployment is easy. All you have to do is log in to your server and do a git pull, and probably restart your web server. No problem.

If you grow beyond a single machine, though, this technique is rife with problems: the time it takes to deploy code grows linearly with the number of servers you have, it’s difficult to synchronize deployment, and it’s simply error-prone. Any point in your deployment process that requires you to log in to a server and type multiple commands is just asking for trouble.

We use a tool called Fabric to automate this process. Fabric makes it really easy to run commands across sets of machines. It’s similar to Capistrano (Ruby) but it’s written in Python so it was an easy choice for us

Imagine this scenario: your company is growing rapidly and you’re hiring tons of engineers. The passwords to many of your servers are stored in plaintext in configuration files — everyone has access to them. Your main database – the one with all the user data – is available to anyone who has that file.

You fire an engineer. Now what? He knows the passwords to everything. Do you trust him? What if you’re firing him because he’s just a bad egg? What can you do?

Well, you could change the passwords on every single server… but that’s a huge pain in the ass for everyone involved. Luckily, there’s a solution for this problem, one that comes with quite a few other benefits as well: you set up a VPN.