June 23, 2020

3 Reasons Counting is the Hardest Thing in Data Science

Counting is hard. You might be surprised to hear me say that, but it's true. As a data scientist, I've done it all - everything from simple regression analysis all the way to coding Hadoop Map Reduce jobs that process hundreds of billions of data points each month. And, with all that experience, I've found that counting often involves far more time and effort.

1) Counting requires numerous, often arbitrary decisions

Questions like How many computer science students were there at UNC Charlotte last year? or How many graduates from North Carolina's public universities find employment within one year of graduation? Seems simple, right?

Unfortunately, answering those questions required defining a whole host of terms. Just for counting students, you have to decide.To get right path visit Data science training.

Do part-time students count?
What about non-degree-seeking students?
Undergraduate only maybe?
Are we counting total unique individuals enrolled over the course of a year, or something else?
School year, fiscal year, or calendar year?
How do we count students that enrolled in multiple programs? Is it OK that enrollment for the university is lower than the sum of the enrollments in its constituent programs?
Of course, depending on the purpose of the data, there are right answers to many of these questions. For budgetary purposes, it probably makes sense to go with the fiscal year, for example. But somebody has to make those decisions, which means somebody has to take ownership of setting the business rule.

2) Counting is easy to understand

This one isn't necessarily unique to counting, but it does apply to any sort of basic statistical research. The simpler a statistic or a model is to understand, the easier it is for stakeholders to articulate an opinion about.

Say you go to a PM or a middle manager in your company and tell them "We've just finished work on a machine learning model that can detect 90% of fraudulent orders with very few false positives." The response you're likely to get is something along the lines of "Great work! What would it take to put this into production?" It's very unlikely that they're going to dicker about the features or hyper-parameters of your model. The relative complexity involved means it's effectively a black box.

Not so with simpler models. Most people have a pretty good intuitive understanding of things like correlation or multiple regression, even if they don't know all the details about how they work. And everybody can understand counting. This means that instead of getting a quick "Good Job" in response to your work, you're much more likely to get a host of questions about how your research was done.

Of course, there are upsides to this - all of our work could probably benefit from the added scrutiny of stakeholder review. Nevertheless, it adds a significant amount of relational and political overhead to the actual analytics.For indept understanding Data science online training india.

3) Counting is often high-stakes

Again, this isn't exclusive to counting... but it does greatly magnify the effects of reasons 1 and 2. Counting is very frequently high-stakes for people other than the analyst. Consider:

How many sales did Jones make last year? Her bonus likely depends on it.
How many people live in Austin, TX? Getting this wrong could alter re-districting and change the balance of political power.

Counting requires making a lot of (often arbitrary) decisions. It's simple enough that everybody can articulate an opinion about those decisions. And it's often important enough that folks will have a very strong incentive to form and articulate opinions. This is a recipe for fierce political battles involving stakeholders with entrenched, often conflicting, interests.

In the end, the counting itself may be unbelievably easy... a simple COUNT DISTINCT query with a carefully crafted WHERE clause is a pretty trivial task for any data scientist worth his salt. But making all the decisions necessary to actually start doing the counting is frequently a long, frustrating, relational-not-technical process.If you are willing to expert in data science reach Data science online training.