Building a Statistics Dashboard for Mozilla Marketplace

21 Jun 2012

It’s been a busy month-and-a-half since my last update filled with coding,
basketball, some Saint’s Row 3 and Mass Effect, and more recently tennis.
There hasn’t been quite a dull enough stretch enough time where I thought ‘hm
nothing to do, I should sit down for a couple of hours and write a nice old blog
post’. But I have been burning through my list of bugs with Marketplace Beta
launching today, and my body is sore from straight days of basketball,
lifting, and tennis. So now is as good of a time as any to write.

I’ll be explaining how the statistics dashboard for Mozilla Marketplace.
Marketplace is Mozilla’s app store meant to support their mobile operating
system, Boot2Gecko. What’s unique about Marketplace is that the apps are
simple webapps (HTML/CSS/JS) and is thus platform and device independent. Apps
can be installed on any OS, any browser, any device. Developers are no longer
tied to a specific programming language (Java/Obj-C) and app store
(Android/iOS). I’ll be explaining how it currently works and be talking about
what it was like to work on it.

Statistics Dashboard

A statistics dashboard is a page that displays data in the form of graphs,
charts, and tables with controls to change what data is displayed. An example
is Google Analytics which gives information about a website’s hits. This helps
users know how well their site is doing, what kinds of people are visiting,
and what they do during their visit. The landing dashboard shows hits over
time along with a table with aggregated stats.

For Marketplace, the statistics dashboard should let developers know how well
their apps are doing, where their purchases are coming from, and how much bank
they are making. For the dashboard, we want to precalculate the data so it
doesn’t have to get calculated every time a user pulls up the dashboard since
tht would be slow. Thus we store the data in a data store beforehand,
calculating it daily.

An Analogy

Kevin is a boy who likes to play with toys. To get toys, Kevin has to go to
the toy store, but the toy store is far away. It doesn’t make sense to have to
go to the toy store every time Kevin wants to play with toys. Not only that, he
has to spend time in the toy store looking for what he wants. So instead he
goes to buy toys every day after school and puts it in his drawer. He
categorizes different types of toys into different drawers so he can quickly
locate his desired toys. Now every time he wants to play with a certain toy,
he doesn’t have to go all the way to the toy store, find the toy, and bring it
home. He can just go to his drawer.

Here’s the part where I explain the analogy. Kevin represents a user of
Marketplace who wants to pull up certain data (toys) from the statistics
dashboard. The slow way to do it would be to ask the database (toy store) for
data which is represented by the time it takes to get to the toy store. Once
the data is handed over from the database, calculations has to be done to it
to get the desired metric of data we want, which is represented by the time it
takes Kevin to find his toy in the toy store. Instead, we do everything at
once beforehand (every day after school) by grabbing data from the database
(toy store), performing aggregations (finding the toy), and storing it into
the datastore (a certain drawer). Then whenever a user hits the page, time
isn’t wasted going to the database every time. Because the task of calculating
data beforehand is done asynchronously (in the background), users don’t
experience any slowness. Kevin is free to play with his toys on demand.

How It Really Works

The stack consists of Python in the backend (and Django as the web framework),
highcharts.js for the frontend, and ElasticSearch
acting as a key-value store (drawer). The inital codebase was copied over from
Mozilla’s add-ons site.
The backend’s duty is to pull objects from the database (toy store), perform aggregations, and store the aggregations into ElasticSearch, which
is all done asynchronously in the backend as a cron job (task set to
run at a regular interval, like an alarm clock). The frontend’s duty is to
query (ask for data) ElasticSearch for the already aggregated data and
display it with highcharts.js, a graphing library (set of tools to create
charts).

Let’s say we want to display the number of an app’s sales per day as a line
graph. Our desired metric is sales per day. We don’t want to calculate
data for already calculated days since that would be a waste of computing
power. So we grab the newest object from ElasticSearch, check its date and
time, and then we know where to start from.

First we have to write an indexer. It first pulls payment objects, or
purchase records, from the database that are later than the newest object
from ElasticSearch. We check a purchase record’s date, say ‘oh, we need to
calculate the sales per day for that date’, and count the number of
purchase records during that day. This counting functionality is provided
by the database (or rather Django’s ORM). We package some data together
into a ‘document’ and store it into ElasticSearch. The document needs more
than just the number of sales for that date we just calculated, we need to
be able to associate it with certain values (categorize the toy) to know
how to pull back out later. Values we need to attach to the count include
the app’s id (like a social security number) and the for which date the
count represents.

So the data is stored, the toy is in the drawer ready to be played with.
The data sits there until it is needed when a user visits an app’s
statistics page. This is where the frontend comes in. When a user visits
the statistics page, a template page (HTML/CSS) is loaded with no graphs or
tables yet. It just contains things like a header and links. Django, the
web framework, handles this part, and the rest is up to Javascript. Space
is left in the page for a graph and table, which Javascript populates.

The Javascript code makes a request to the web server with a standard URL
(like how you would request Facebook’s page with facebook.com). The URL has
parameters attached to it which tells the server what kind of data it
wants. The server catches the request, queries ElasticSearch for the data,
and responds with data (in the form of JSON or XML). This process is called
AJAX, where Javascript makes a request by itself client-side (from the
browser), a request separate from the loading of the initial page. So the
Javascript grabs the data from the server’s response, passes it to our
graphing library (highcharts) which creates pretty graphs for us. The data
is cached into local storage so reloading the page doesn’t make another
redundant request to the web server. And that’s it from a heuristic level,
we have pretty data!

Thoughts on Working on It

I’ll start with the basic workflow. My mentor or
web QA breaks down the whole statistics project into small bug tickets or
tasks to make it easy to manage. I choose a bug to work on, ordered by
priority, create a Git branch just for that bug, and then write code (and
tests for that code). I make a lot of local Git commits, which I squash
into one commit for the bug. I ask for a code review, usually from my
mentor, then I make revisions to my code based on comments from the code
review until the code is cleared for take-off. I merge the code into my
master branch and push it to the central Mozilla repository for the
Marketplace project.

I grokked the codebase pretty quickly, but there were things that gave me a
lot of trouble and have eaten up many of my hours. The main culprit was
ElasticSearch. How the hell does ElasticSearch work? Who knows. Data comes
in, data goes out,you can’t
explain that.
The documentation isn’t very helpful and there wasn’t documentation for
ElasticSearch for the Marketplace project. It’s very difficult to debug
since it’s a big black box. There was a difficult hurdle where
ElasticSearch stored my data as lower-case and tokenizes strings on
hyphens. I spent a solid day figuring out where I put in ‘Cat’, ask for
‘Cat’, and get nothing without any traceback. I had to keep putting in
different inputs to see what worked. Only recently did I learn about
setting up ElasticSearch mappings, which is similar to setting up Django
models. ElasticSearch runs analyzers and tokenizers apparently, and I had
to specify fields to not be analyzed. I think I developed a rash on my head
from so much head-scratching. 0.0?

Other difficulties…Working on fifteen different files at once, all being
touched across five different active branches can get somewhat confusing.
Debugging Django reverse URLs is a !@#$% when you got 50 URLs each with
five or six parameters, and the order at which you define them can mess
things up.

Takeaways. Experience on grokking and working on another codebase, some Git
skills, things to add to my inner coding style guide, yet another reminder how
useful tests are, Django reverse URLs skills, working in a different workflow,
and some pride being given the go-ahead to write something to be seen by tens
of millions.

Final Thoughts

After this, I’ll be working on porting Mozilla themes (previously called
Personas) to Marketplace. Upcoming blog posts in mind include talking about
what the open web is and Mozilla’s genuine mission to forward the web. A few
hackathons coming up, expect some webapps to be churned out and featured here.

Apart from that, life is awesome. Learning lots. Everyday, I get to do what I
enjoy all day surrounded by many cool people and go home to do more things I
enjoy. I live next to basketball/tennis courts, pool/hot tub, gym, and I’m
feeling healthy. Food (and Vitamin waters) are plentiful and free. Hell, I’m
pretty much saving up for retirement now since I got everything I need. Watched
the NBA Finals and play Halo on a giant 110-inch screen at the office. Former
boss just got hired to Mozilla, I wouldn’t be here if I hadn’t been working at
NET. Just got off a little
vacation with my family and
many more fun things to expect for the summer (woo, kayaking). And Silicon
Valley is nice and sunny, clear skies ahead.