Disclaimer:

These are my personal views and are meant for Informational purpose only. Please verify the Information via Professional help or via Official references before acting upon the information provided in this Blog.

Big Data

This was asked on Reddit: Any advice for moving into data science from business intelligence?

Here’s my answer:

I come from “Business Intelligence” background and currently work as Sr. Data Scientist. I found that you need two things to transition into data science:

Data Culture: A company where the data culture is such that managers/executives ask big questions that need a data science approach to solve it. If your end-consumers are still asking bunch of “what” questions then your company might NOT be ready for data science. But if your CEO comes to you and says “hey, I got the customer list with the info I asked for but can you help me understand which of these customers might churn next quarter?” — then you have a data science problem at hand. So, try to find companies that have this culture.

Skills: And you need to upgrade your skills to be able to solve data science problems. BI is focused too much on technology and automation and so may need to unlearn few things. For example: Automation is not always important since you might work on problems where a model is needed to predict just a couple of times. Trying to automate wouldn’t be optimal in that case. Also, BI relies heavily on tools but in Data science, you’ll need deeper domain knowledge & problem-solving approach along with technical skills.

Also, I personally moved from BI (as a consultant) -> Analytics (as Analytics Manager) -> Data science (Sr Data Scientist) and this has been super helpful for me. I recommend to transition into Analytics first and then eventually breaking into data science.

When reading about Big Data, this starts with the definition of Gartner’s analyst Doug Laney (3Vs). IBM is often using 4 dimensions by adding veracity. Some people are using 6 or up to 12 dimensions. I am wondering what’s the most frequently used definition?

You can add as many V’s as you want to but it all ties back to the notion that you need bigger and better tools and processes to support your data analysis needs as you grow.

Example:

#1. Social Media Data is BIG! It’s Text (variety) and much bigger in size (Volume) and it’s all coming in very fast! (velocity) AND business wants to analyze customer sentiments on social: OK — we have 3V’s problem and need a solution to support this. Maybe Hadoop is the answer. Maybe not. But you do have a “Big Data” problem.

#2: Your Customer Database is broken. They don’t right addresses. Google and Alphabet are showing up as two separate companies when they should be just one. Their employee count is outdated and All of these problems is confusing your business user and they don’t TRUST the data anymore. You have a veracity problem and so you have a BIG Data problem.

Everyone has a BIG DATA problem. It just depends what there “v’s” are AND it most cases “tools” alone will not solve the issue. You need PEOPLE and PROCESS to solve that. Here’s my ranking: 1) PEOPLE 2) PROCESS 3) PLATFORM (tools) for ingredients that are key to solving BIG Data problems.

Someone asked this on Quora about how to learn & explore the field of Big Data Algorithms? Also, mentioned having some background in python already and wanted ideas to work on a good project so with that context, here is my reply:

There are two broad roles available in Data/Big-Data world:

Engineering-oriented: Date engineers, Data Warehousing specialists, Big Data engineer, Business Intelligence engineer— all of these roles are focused on building that data pipeline using code/tools to get the data in some centralized location

Business-oriented: Data Analyst, Data scientist — all of these roles involve using data (from those centralized sources) and helping business leaders make better decisions. *

*smaller companies (or startups) tend to have roles where small teams(or just one person) do it all so the distinction is not that apparent.

Now given your background in python and programming, you might be a great fit for “Data engineer” roles and I would recommend learning about Apache spark (since you can use python code) and start building data pipelines. As you work with a little bit more than you can learn about how to build and deploy end-to-end machine learning projects with python & Apache spark. If you acquire these skills and keep learning — then I am sure you will end up with a good project.

Yes — it’s not a must have to work as a Data Analyst. In fact, a lot of people come from a non-CS background and succeed in this role!

Let’s look at the pros and cons of having a computer science (CS) degree and this should help you evaluate where you fall:

Pros of having a CS-degree:

If the data analyst position requires you to have this degree in CS then you qualify! Fortunately this is not that common and usually it says bachelor’s required in cs, business administration or related field so as long as you have bachelors for positions that require it then you should be fine

you might already have the basic tech skills that are needed for data analysis jobs and the CS degree might be used to validate that.

you can pick up new tech concepts and tools fast(er) — with the cs background, it’s easier to pick up new concepts & tools — and you need to continuously do that to stay relevant.

Cons of having a CS-degree:

Not enough business problem solving experience and/or lack depth in business knowledge — so if you have a degree in business then you come ahead! Especially if your background aligns with the role. For example: if you focused on Marketing in your bachelors and the role is focused around marketing analytics then you might have an edge

I have a CS degree and then I followed it up with a masters from a “business school” — so this is just based on my experience but few CS students (without real world experience) are inclined to focus on “automation” and “bleeding-edge” instead of focusing on what the problem needs. Lot of data analysis doesn’t need to be automated or shouldn’t be automated and not every company needs <<insert the latest tech trend here: big data, deep learning>> — but CS students tend to do that. That’s what they feel most comfortable with so while that doesn’t stop from getting the job, this would impede their growth as a data analyst within the org.

Conclusion:

So as you can see even if you don’t have a CS degree, you can still find roles that align with your other skills and in fact, you might be able to come out ahead if you can prove that you have basic quantitative and tech skills needed to get the job done.

Like this:

Question: What data are data scientists at startups actually analyzing? How is it collected?
(Coming from a web analytics background I’m wondering what data are data scientist at IT companies actually analyzing. Is it server-side or client-side? Is it collected internally or using some external tool?)

Answer:

Part 1: What are startups analyzing?

It depends on the Business Model and the Stage that they are at.

Business Models: Marketplace, Ecom, SaaS, Media, etc.

Stage: Early, Mid, Late

So let’s say you have a SaaS model and you’re in Mid-stage (post product-market fit stage) then you would tend to be focused on things like: Engagement, Churn, etc…and ideally they should be focused on measuring what aligns best with the strategy (instead of capturing everything!)

Let’s take another example. Let’s say you are a Marketplace in late-stage. So you would tend to be focused more on the “money” and so you can measure things like: transactions, commissions, etc…

I recommend reading “lean analytics” book as it goes much deeper and it’s a great starting point for anyone to understand how analytics could help a startup.

Part 2: How is it collected?

Now this also depends on your product. Assuming you’re a tech startup, you would have Web App and/or Desktop app and/or Mobile app. And now depending on your delivery approach plus your measurement needs, the “how” part will be determined. It would invariably be a combination of your transactions data source, web/mobile events stack (like Google analytics/other-Vendor or Custom), finance data source among others.

What is the title these days for a person that assures data quality?
(I need to hire a person to make sure my data is as good as it can be. They need to inspect the data for issues, create logic for how it can be found and fixed, and finally, court the project through application development for a robust solution to stop it from occurring in the first place.)

Answer:

Quality of the data shouldnt be a responsibility of just one person — ideally, you want all members of the team (and broader business community) to care and own some part of it. But i like the idea of one person owning the “co-ordination” of how this gets done. It might not be a full time gig in a small org but can see this as a full time role in bigger orgs and enterprises. Some titles:

It depends on how the Analytics & Data Science team is structured in an org but usually you will see following trend:

“Big Data Developer” usually rolls up under the Engineering org. They are responsible for building the data pipelines that feed data to the “data platform” — they use things like Hadoop, Spark, Custom Code, ETL tools, etc to build data pipelines and are responsible developing and maintaining the data platform. And to succeed in this role you need to have deep technical chops. Other titles for this role: Data engineer, Software engineer, etc.

“Data Analyst” usually rolls up under some “business” team like strategy, operations, growth, product, marketing, sales, etc. Data Analyst are the link between the “data platform” and the “business” — these guys are primary consumer of the “data platform” (sometimes you might see shared ownership of data platform between engineering and analytics). They help solve business problems using data and pull data from the “data platform”. These guys need to have a good balance between business and technical skills to be successful in this role.