3 Reasons Counting is the Hardest Thing in Data Science

Counting is hard. You might be surprised to hear me say that, but it's true. As a data scientist, I've done it all - everything from simple regression analysis all the way to coding Hadoop MapReduce jobs that process hundreds of billions of data points each month. And, with all that experience, I've found that counting often involves far more time and effort.

He makes it look so easy!

So why do I think counting is so hard? Let me give you three reasons.

1) Counting requires numerous, often arbitrary decisions

When I was working on NC Tower at the NC Department of Commerce, I needed to answer a bunch of really basic questions about North Carolina's higher education system. Questions like How many computer science students were there at UNC Charlotte last year? or How many graduates from North Carolina's public universities find employment within one year of graduation? Seems simple, right?

Unfortunately, answering those questions required defining a whole host of terms. Just for counting students, you have to decide...

Do part-time students count?

What about non-degree-seeking students?

Undergraduate only maybe?

Are we counting total unique individuals enrolled over the course of a year, or something else?

School year, fiscal year, or calendar year?

How do we count students that enrolled in multiple programs? Is it OK that enrollment for the university is lower than the sum of the enrollments in its consituent programs?

Of course, depending on the purpose of the data, there are right answers to many of these questions. For budgetary purposes, it probably makes sense to go with the fiscal year, for example. But somebody has to make those decisions, which means somebody has to take ownership of setting the business rule.

Worse still, there may often be different stakeholders for whom the "right answer" is different. Maybe fiscal year works for the state budget folks, but federal reporting is done based on the school year. Now you have to either find a compromise, convince a higher authority to take ownership of the situation and decide, or produce two counts. Complexity (technical or relational) increases accordingly.

As you can imagine, all of these issues become all the more challenging as the decisions become increasingly arbitrary. If Bobby Smith and Robert Smith share the same suspicious-looking SSN and are enrolled in 2 different universities, do they count as one unique individual, or two? What if it was Bobby and Barbara? There's probably no right answer here, but you still have to get buy-in on some business rule to handle those discrepancies.

2) Counting is easy to understand

This one isn't necessarily unique to counting, but it does apply to any sort of basic statistical research. The simpler a statistic or a model is to understand, the easier it is for stakeholders to articulate an opinion about.

Say you go to a PM or a middle manager in your company and tell them "We've just finished work on a machine learning model that can detect 90% of fraudulent orders with very few false positives." The response you're likely to get is something along the lines of "Great work! What would it take to put this into production?" It's very unlikely that they're going to dicker about the features or hyper-parameters of your model. The relative complexity involved means it's effectively a black box.

Not so with simpler models. Most people have a pretty good intuitive understanding of things like correlation or multiple regression, even if they don't know all the details about how they work. And everybody can understand counting. This means that instead of getting a quick "Good Job" in response to your work, you're much more likely to get a host of questions about how your research was done.

Of course, there are upsides to this - all of our work could probably benefit from the added scrutiny of stakeholder review. Nevertheless, it adds a significant amount of relational and political overhead to the actual analytics.

3) Counting is often high-stakes

Again, this isn't exclusive to counting... but it does greatly magnify the effects of reasons 1 and 2. Counting is very frequently high-stakes for people other than the analyst. Consider:

How many sales did Jones make last year? Her bonus likely depends on it.

How many people live in Austin, TX? Getting this wrong could alter re-districting and change the balance of political power.

How many students are there at UNC Greensboro? That'll be a big factor in budget appropriations next year.

You get the point.

Counting requires making a lot of (often arbitrary) decisions. It's simple enough that everybody can articulate an opinion about those decisions. And it's often important enough that folks will have a very strong incentive to form and articulate opinions. This is a recipe for fierce political battles involving stakeholders with entrenched, often conflicting, interests.

In the end, the counting itself may be unbelievably easy... a simple COUNT DISTINCT query with a carefully crafted WHERE clause is a pretty trivial task for any data scientist worth his salt. But making all the decisions necessary to actually start doing the counting is frequently a long, frustrating, relational-not-technical process.

Some Examples

Still not convinced that counting is hard? Consider these examples...

Florida, 2000 - Who got more votes in Florida in 2000 - Bush or Gore? The answer to that question depended on a host of relatively arbitrary decisions (Does a dimpled chad count as a vote?) which were simple enough for the average Joe to form an opinion about and which determined nothing less than who would hold the most powerful office in the world. Counting votes is hard.

Unemployment rate - Since the "Great Recession," politicians and commentators have spilled an unbelievable amount of ink about the "real unemployment rate," which seems like it should be as simple as a count of the number of people that want a job and don't have one divided by a count of the number of people that want a job. Counting employment is hard.

Active users on Twitter - How many users did Twitter have at the time of its IPO? Another seemingly simple question that required answering dozens of not-so-simple questions. Unique individuals or unique accounts (and how do you identify individuals)? What counts as a spam or bot account? What determines an active user? Literally billions of dollars hung in the balance, and probably nobody knew what the actual count was. Counting users is hard.

Conclusion

In the end, this post is not meant to be a rant or a complaint, but rather a statement of reality. In your career as a data scientist, you will not only need to manage significant technical complexity and communicate the findings of various analytical projects clearly and effectively. You will also need to manage the relationships and politics that surround those analytical projects - which is often no easy task, especially for counting.

Of course, once everybody's on the same page, you get to go write your code, crunch your numbers, and get your results. And that, to me, is worth navigating all the other stuff.

10 Responses

I’d totally agree that counting is the hardest part of my job. When apis timeout all we have are truncated counts. Most user / business events are point processes. We have durations: http://statwonk.com/parametric-survival.html Everywhere, counts! Most of our counts are well-defined, so we’re able to move into more general statistical machinery. This of course presents new challenges: windows? lag-effects? updating priors? decision rules?

I’d love to chat a bit more over video. It seems like we’re working in a similar area.

Hi,
Christopher is right about to point out the issue of missed counts and the difficult corrections.
Indeed, even when you know what to count, the actual task of counting can be difficult.
I had many headaches understanding probabilistic data structures for counting (see probabilistic counters, hyperloglog, count-min sketch…).
Those let you answer questions like the number of distinct elements in a set or quantile estimation with limited memory / streaming. Of course most of the time it is overkill, but those tools are getting mainstream (eg native integration into databases).
Thanks for the read!
++

couldn’t agree more. counting things right is the hardest part. because it’s important and because it requires making the right assumption about whom/what are we looking at, during what timeframe, under what condition etc. those are not just arbitrary decisions but critical decisions by someone who understands the overall context of the data. oftentimes those counts are used to drive business decisions. so getting those rights is super important.

It can be so hard to resist disengaging (read: doing the slow blink) when people repeatedly ask you why it’s so hard to count things but don’t really want to spend the time to understand the myriad decisions that have to take pace to do that kind of data work well. It’s important to lean into this work though, even though it can be such a minefield sometimes! Thank you for writing this. I’m sharing it!