How to Get Started in Data Science

A lot of people ask me: how do I become a data scientist? I think the short answer is: as with any technical role, it isn’t necessarily easy or quick, but if you’re smart, committed and willing to invest in learning and experimentation, then of course you can do it.

In a previous post, I described my view on “What is a data scientist?”: it’s a hybrid role that combines the “applied scientist” with the “data engineer”. Many developers, statisticians, analysts and IT professionals have some partial background and are looking to make the transition into data science.

And so, how does one go about that? Your approach will likely depend on your previous experience. Here are some perspectives below from developers to business analysts.

Java Developers

If you’re a Java developer, you are familiar with software engineering principles and thrive on crafting software systems that perform complex tasks. Data science is all about building “data products”, essentially software systems that are based on data and algorithms.

A good first step is to understand the various algorithms in machine learning: which algorithms exist, which problems they solve and how they are implemented. It is also useful to learn how to use a modeling tool like R or Matlab. Libraries like WEKA, Vowpal Wabbit, and OpenNLP provide well-tested implementations of many common algorithms. If you’re not already familiar with Hadoop — learning map-reduce, Pig and Hive and Mahout will be valuable.

Python Developers

If you’re a Python developer, you are familiar with software development and scripting, and may have already used some Python libraries that are often used in data science such as NumPy and SciPy.

To deal with large datasets, learn more about Hadoop and its integration with Python via streaming.

Statisticians and applied scientists

If you’re coming from a statistics or machine-learning background, its likely you’ve already been using tools like R, Matlab or SAS for years to perform regression analysis, clustering analysis, classification or similar machine learning tasks.

R, Matlab and SAS are amazing tools for statistical analysis and visualization, with mature implementations for many machine learning algorithms.

However, these tools are typically used for data exploration and model development, and rarely used in isolation to build production-grade data products. In most cases, you need to mix-in various other software components in like Java or Python and integrate with data platforms like Hadoop, when building end-to-end data products.

Naturally, becoming familiar with one or more modern programming languages such as Python or Java is your first step. I found it very helpful to work closely with experienced data engineers to better understand the mindset and tools they use to build production-quality data products.

Business analysts

If your background is SQL, you have been using data for many years already and understand full well how to use data to gain business insights. Using Hive, which gives you access to large datasets on Hadoop with familiar SQL primitives, is likely to be an easy first step for you into the world of big data.

Data science often entails developing data products that utilize machine learning and statistics at a level that SQL cannot describe well or implement efficiently. Therefore, the next important step towards data science is to understand these types of algorithms (such as recommendation engines, decision trees, NLP) at a deeper theoretical level, and become familiar with current implementations by tools such as Mahout, WEKA, or Python’s Scikit-learn.

Hadoop developers

If you’re a Hadoop developer, you already know the complexities of large datasets and cluster computing. You are probably also familiar with Pig, Hive, and HBase and experienced in Java.

A good first step is to gain deep understanding of machine learning and statistics, and how these algorithms can be implemented efficiently for large datasets. A good first place to look is Mahout which implements many of these algorithms over Hadoop.

Another area to look into is “data cleanup”. Many algorithms assume a certain basic structure to the data before modeling begins. Unfortunately, in real life data is quite “dirty” and making it ready for modeling tends to take a large bulk of the work in data science. Hadoop is often a tool of choice for large-scale data cleanup and pre-processing, prior to modeling.

Final thoughts

The road to data science is not a walk in the park. You have to learn a lot of new disciplines, programming languages, and most important – gain real-world experience. This takes time, effort and a personal investment. But what you find at the end of the road is quite rewarding.

Hi there.. couple of questions.. I am pursuing MBA in E commerce, but my bachelors is in management only.. thus, no technical background.. .I want to become a data scientist.. what will be ur suggestion to me… plz suggest fields that can be a good career option for me. and also high paying. ..

I would consider learning some programming languages: python, Scala, and Java. There may be some graduate or post-graduate courses offered by universities where you reside. In fact, UC Berkeley offers a masters program in Data Science.

Hi. I am a MBA and I am consumer market researcher by profession. I am really interested in data analysis and data science. I dont have a computer science background. How do you suggest I should start?

I am an Oracle Certified DBA (OCP) in 11g R2 Version.I also have knowledge of Linux,C language.Now I really want to become a Data Scientist.What should i do?What technologies i have to learn from now on to be the same.Please tell me first step of this learning process.

I would start downloading our Sandbox and play around with some basic tutorials that deal with ETL processing: extracting, transforming and loading data. With DB background and SQL knowledge you will be able to pick up Apache Hive easily. Second, start playing with Spark on YARN and Spark SQL. For DBA, it’s a good transition. But biggest challenge is dealing with NoSQL concepts of Hadoop and the repositories.

I’m doing my master’s in Mechanical Engineering and quite familiar with Matlab, SQL, Statistics and SAS. What should be the other thing for me to learn to become a data scientist. In other sense how should I utilize my knowledge to get a tag of data scientist.

Can you please let me know what role a system admin / Infrastructure person can play in Hadoop ecosystem,,, i am interested to know beyond just setting up hadoop / hdfs / hbase cluster and using scoop for data transfers.

Hey thanks for the info.. but i have often seen people coming from diverse backgrounds like social science, management, economics opting for data science. What are there chances of doing well in this field?

Chetan, of course there is hope for them. If you think about it, those folks have strong data engineering skills which are paramount to the role of data scientist. Further, if a person is skilled in BI then I would contend that they ought to have a solid foundation in basic statistics and math.

Related Pages

Advanced Analytic ApplicationsNew types of data present opportunities to make better decisions and build businesses through systems of insight. Hadoop makes it possible to capture, store and process this new data to apply predictive analytics, form a single view of the customer, or discover unseen patterns.