A lot of career experts recommend creating a portfolio of relevant work when trying to land your first job in a field. In my opinion, it is not necessary to build a formal portfolio in order to get a data science job. However, there is a lot of benefit to having an easily accessible list of relevant projects that you are capable of discussing in an interview. In this post, I will explain how to build a useful “portfolio” without adding a lot of extra work to your job search.

The Github Portfolio

In my opinion, it is a bit foolish to invest a lot of time into building a formal portfolio with pretty pictures and three ring binders. However, there is a better approach to take that will save you time while also building technical skills and preparing you for interviews. You could call it the Github Portfolio.

What I recommend is to take a few technical projects (preferably data science projects) and push code for them up to one or more publicly available repositories in Github. I talk about the benefits of having a Github page in my Creating a Great Data Science Resume post. Github is the most commonly used SaaS tool for version control, and simply having a Github page will set you apart. Also, using Github allows you to put the version control system git on your resume!

Here’s what you need to do:

Create a Github username if you don’t already have one.

Do a basic git and Github tutorial. (You don’t need to understand git to make this portfolio, but a little bit of knowledge helps)

Select 2 or 3 relevant projects that you are comfortable sharing. NOTE: It is not important that they be perfect or even “clean.” Just having them there is already a boost.

Write basic README’s for each project. This does not need to be anything more than “This project was about _.”

Push the code up to Github. You can do this using command line git, using the Github desktop app, or even copying and pasting manually on the Github webpage.

Now, put the link to your Github page on the top of your resume. There, you officially have a portfolio.

Selecting Projects to Include

You might be having trouble selecting which projects to include on Github.

My recommendation is not to be overly choosy about what you put there. Putting it on Github with a README might help you figure out how to clean it up a bit. Remember: just having a Github page is a boost for your resume. “Do you have a Github username?” is a common question people ask in internship interviews, because having a Github page shows that you are the right kind of nerdy.

If you do not have any projects to post on Github, then you need to get cracking. Putting something on Github does not have to be a major ordeal. You can even adapt some code from a tutorial like my tutorial on machine learning with R and make it your own.

Preparing for Interviews

One of the best things about having a Github portfolio is that you can reference the projects there during interviews. The key is to practice a few basic talking points about the project and anticipate common questions. For example,

Interviewer: “Can you tell us about a project you did using machine learning?”
You: “Yes. I wanted to build a model to predict cryptocurrency price drops. I collected the data from a crypto price API using Python, and then I built a model using random forests using pandas and scikit-learn. You can check out the code on Github.”
Interviewer: “Oh, interesting. What made you choose a random forest?”
You: “Well, I love random forests because of their flexibility, so I tried a random forest first. I tested logistic regression and a multilayer perceptron, too, but random forest provided the best accuracy with decent performance.”

You will want to be ready to discuss a few key aspects of your project. Doing well at this part of the interview is a huge plus, since it shows some depth and good communication skills.

A short description of the project. What you did, why you did it. This should be short (like 10-30 seconds), because rambling is the single greatest interviewing sin.

You should be able to explain some of the key decisions you made. Why did you choose a specific model? How did you evaluate the performance of the model or the recommendations you made? How did you get the data? Did you have to clean the data? How would you deploy the model at large scale?

What would you do differently next time?

What was unexpectedly difficult about the project?

What did you discover that was surprising.

Don’t sweat it

It is not a requirement to have a portfolio of any sort. Don’t let not having a portfolio keep you from pushing into the job market. I definitely recommend creating a Github page, writing a few README’s, and practicing a few talking points. But do not let building a fancy portfolio keep you from the challenging and important parts of job hunting. Actually knowing a bit of data science, building a great resume, actively networking, and practicing your interview skills will make a bigger impact for most people.

Google BigQuery is a “Data Warehouse as a Service” on Google Cloud Platform. Google Cloud Platform (GCP) is Google’s answer to Amazon Web Services (AWS). AWS is still the market leader in the cloud infrastructure market, but GCP continues to grow. Google BigQuery is a serverless data warehouse with a simple web interface, which makes it a great fit for beginning data scientists.

One problem I have recently encountered is to collect the results from a RESTful API in JSON format and convert them into a CSV. If you are not familiar with RESTful APIs, the basic gist is this: You send a JSON document with a request payload to a web URL, and the API returns a response, typically a JSON document. I needed to repeatedly query the API to page through thousands of results, and then I needed to figure out how to combine those results and convert them to a CSV. One of the challenging things about this problem is that JSON documents can have complex nested structures, so they usually can’t be simply “converted” to a flat format like CSV. Enter BigQuery.

Getting Started with BigQuery

Anyone can get started with a 300 dollar credit on Google Cloud Platform. This gives you a lot of wiggle room to mess around with the technology, since BigQuery charges 5 dollars per TB scanned. Just be careful messing around with the demo datasets, since some of them are pretty big and could eat up your 300 dollar credit. The first step is to sign up for a free trial of the Google Cloud Platform.

Creating a JSON Table in BigQuery

It is simple to create a table in BigQuery out of a collection of JSON documents. (It’s just as easy to create tables with CSV or AVRO files) BigQuery requires you to submit the JSON documents in a format called newline-delimited JSON. A newline-delimited JSON file is a text file in which each line is its own JSON document. That is, each JSON document is separated by a “\n” newline character.

To create a table in BigQuery, you can start with a newline-delimited JSON file you want and ask BigQuery to autodetect the schema based on the file contents. Creating the table is easy in the web UI for BigQuery. Simply go to “Create New Table” and upload the newline -delimited JSON file containing the data. Be sure to select “autodetect schema”. BigQuery will automatically pick up the nested any nested or complex JSON table structure and gracefully put it in a table that is able to be queried with SQL.

Once you have loaded the data, take a look at the data in the BigQuery table. You will see that there are sometimes single rows containing containing multiple rows inside (notice that the row numbers are the same). The multi-line rows are the way that BigQuery represents nested and repeated structures in a flat tabular format.

Querying the Data Using Standard SQL

Now that you have the JSON data in BigQuery, you can use SQL to create “flat” data that can be exported to CSV. For this example, we will use the Github languages public dataset. This dataset describes the programming languages used in each Github repository.

As you can see, each row of the dataset contains a “RECORD” object. Each RECORD object represents a single JSON document. BigQuery Standard SQL has an expressive but slightly tricky syntax for querying these RECORD objects. Google gives a nice explanation here. (Note: BigQuery also supports a SQL dialect called “Legacy SQL,” which is a nonstandard dialect unique to BigQuery.)

Here’s an example query that lets us answer a pretty interesting question: “Which repositories contain the most R code?”

1

2

3

4

5

6

#standardSQL

SELECT repo_name,

COALESCE((SELECT language_unnested.bytes FROM UNNEST(language)ASlanguage_unnested WHERE language_unnested.name="R"),0)asr_bytes

FROM`bigquery-public-data`.github_repos.languages

ORDER BY r_bytes DESC

As you can see, the public repo with the most bytes of R code is hmorzaria/speciesdistributions. If you actually look at the code in this repo, you will see that there are a few huge files of R code (50MB or more!). The 7th result is hxfeng/R-3.1.2, which must just be a copy of the entire source code of R itself!

Let’s talk through this query, because it contains a great deal of what you need to know to handle arrays and nested structures in BigQuery. The key part of the query is the UNNEST function in the subquery. UNNEST takes an array and splits it into multiple rows, possibly with multiple columns, one row for entry of the array. The UNNEST function allows you to treat each language array as its own little table. So the statement

is actually selecting the column bytes from the table UNNEST(language), which we have renamed language_unnested. The clause

1

2

WHERE language_unnested.name="R"

actually is filtering the language_unnested table to include only “R” code.

The subquery in the SELECT statement of the overall query will return the number of bytes that use the R language for each entry in the table. However, the result of this subquery will be NULL for each row where R does not appear in the language array. Therefore, we use a COALESCE statement to fill in the value 0 if the subquery returns NULL.

Here’s one more example of an query using UNNEST. Suppose we wanted to answer the question: “Which repos use R and Python?”

1

2

3

4

5

6

#standardSQL

SELECT repo_name

FROM`bigquery-public-data`.github_repos.languages

WHERE(SELECT language_unnested.bytes FROM UNNEST(language)ASlanguage_unnested WHERE language_unnested.name="R")&gt;0

AND(SELECT language_unnested.bytes FROM UNNEST(language)ASlanguage_unnested WHERE language_unnested.name="Python")&gt;0

Converting the Data to a CSV

BigQuery makes it easy to convert data to a CSV or a Google Sheet. Most query results can be exported by clicking “Download as CSV” or “Save to Google Sheets.”

If your query result contains nested structures like arrays, BigQuery typically will not allow you to export the data. In those cases, you must use the UNNEST function to “flatten” the data before exporting to a CSV

Automating Everything

How could we automate this kind of progress? The best way is to the use the Google Cloud client libraries. I have used the Python client library in the past with some success.

Other Ways to Approach the JSON to CSV Problem

There are countless ways to handle JSON documents, and here are a couple extra:
* A flatten package written in Python. This works but is not as clever with field names as BigQuery.
* The jq command line utility can be used to process JSON documents.

About a year ago, I said that the best place to start for most aspiring data scientists is to learn R and Hadoop. Hadoop is still useful and is a very marketable skill to have, but Spark is quickly emerging as the new Big Data framework of choice. In this post, I will talk through some new developments that make it a great career choice to spend some time learning Spark.

Don’t lots of people still use Hadoop?

Absolutely. Hadoop is still extremely popular and useful. You should still understand the basics of Hadoop. But in the last couple years, Spark has become probably the trendiest (and most lucrative) big data technology in the world. Add in the fact that Spark is way faster than Hadoop for many things and that you can write Spark programs in Python (and R, but less completely), and it is a no brainer to focus on Spark.

What is Spark?

Spark is a big data computation framework like Hadoop. In fact, Spark can use HDFS (the Hadoop Distributed Filesystem) as its filesystem. The main reason that Spark is so much faster than Hadoop is that Hadoop repeatedly reads and writes the intermediate steps of MapReduce jobs to disk, whereas Spark caches most of its computations in memory. The reads and writes to disk are slow, which is why even simple Hive queries can take minutes or more to complete.

Spark’s main abstraction is the Resilient Distributed Dataset (RDD). As indicated by its name, an RDD is a dataset that is distributed across the Spark compute nodes. When an RDD is created, it is stored in memory, which allows you to query or transform it repeatedly without writing to or reading from disk. This in-memory caching also makes Spark ideal for training machine-learning models, since training ML models typically involves iterative computations on a single dataset (e.g. repeatedly adjusting weights in an artificial neural network via gradient descent).

What about Hive or “SQL on Hadoop”?

One of the coolest things about Spark is that it has built-in data connectors to many different kinds of data sources, including Hive. But Spark takes it one step further with SparkSQL, which can be much faster than Hive.

How to run Spark

Option A: Install it locally

Like Hadoop, Spark is meant to be run on a cluster of machines. Nevertheless, you can run Spark on a single machine for learning and testing purposes. There are several steps involved, but I was able to get Spark up and running on my Windows laptop in an hour or so. If things don’t work perfectly for trickier dependencies, you can always try using the virtualization and containering sofware Docker. It allows you to use a completely fresh and standard Linux instance on your Windows or Mac PC (and you could take the same thing you build locally and deploy it easily to the cloud).

Option B: Use AWS EMR (or another cloud computing service)

I really like this option, because it allows you to get practice with another extremely marketable technology, and you don’t have to install a bunch of stuff on your computer. Amazon Web Services (AWS) Elastic MapReduce (EMR) is a web service that allows you to spin up your own Hadoop and Spark clusters using a point and click web interface. This is the way that I started. There are a few steps you have to do first, including signing up for an AWS account and setting up SSH keys to connect to EMR. However, spending a few hours getting started on AWS will help you over and over again if you are trying to learn data science. And you can put EMR on your resume, too. Best of all, there is a limited free tier for first time users.

Resources for learning Spark

I always recommend learning by doing as much as possible. With Spark, there are some good online tutorials to help you get started. But first, it might help to spend an hour learning about how Spark works, so that the tutorials make a little more sense. Another tip: start with things you know, and try to learn one thing at a time. For example, if you already know Hive, try using Spark to query Hive. If you already know Python, use PySpark instead of trying to also learn Scala.

Tutorials and Examples

Most people learn best by example, so I have included a few good tutorials with plenty of examples.
– Getting Started with Apache Spark: Includes a few nice, simple examples of using Spark (requires an email address to access)
– Quick Start: A very light quick start that you can use after you install Spark on your computer
– Spark Examples: A collection of several useful examples to get the basics of Spark

Conclusion

Spark is the trendiest technology out there for data scientists. And compared to Hadoop, it is not too hard to get started. I highly recommend that you give it a try.

Action steps

Figure out how to run Spark, either locally or in AWS

Spend one hour (but not more than that) reading about Spark’s architecture and its execution model. Don’t worry
if you don’t understand everything right away.

Pick one (and only one) Spark tutorial to try out, and follow every step. This is not the time to get too creative.
Just stay focused and stick to one thing at a time.

Try a simple project of your own. Keep it simple to start with, so you can boost your confidence and motivation.

Why Python?

For most people, I recommend getting started with R, because the tools in R for exploratory data analysis and visualization are easier and more comprehensive than the tools in Python. However, if you have a computer science background or if you want to jump on the fast track to high-performance machine-learning, then you might want to start with Python. Python is an awesome programming language, because it is easy to write, readable, and well-documented, and it is very fast if you do it right. It also has extensive libraries for scientific computing, stats, and machine learning.

The Python scientific stack

Python has a comprehensive and integrated scientific computing stack that has an incredible combination of performance, ease of use, and depth. It is made up of several libraries and utilities, including:

numpy: Fast and easy array computations and manipulations. Includes “broadcasting” and “fancy” indexing, which give Python arrays some of the simple syntax of R vectors.

Jupyter (previously IPython Notebook): A browser-based notebook for scientific computing in Python and other languages.

Matplotlib: A nice plotting library. Many people use Seaborn as a user-friendly alternative.

A couple of quick tutorials to get started

Kaggle has a great tutorial series on getting started with Python. It takes you through the basics of loading data, manipulating data, transforming data, and building a random forest machine learning model. It uses the Titanic survival dataset to walk you through all of these skills in a practical case study. It is probably best to start at part I, although it is probably OK to skip part I if you are impatient.

The Kaggle tutorials should get you started and start building your confidence. After these, I recommend the scikit-learn quick start tutorial. This gives a bit of the bigger picture on scikit-learn and the concepts of machine learning. The scikit-learn documentation taught me so much, and I highly recommend it. The scikit-learn website also contains lots of code examples, although the examples can seem a bit complex at first.

Where to go from here

If you really sit down and work through these tutorials, you will be ready to try some more examples on your own. I recommend checking out a couple more straightforward Kaggle competitions like Give Me Some Credit. Machine learning is a craft, and it takes practice to get good at it, but the payoff can be huge.

“I haven’t heard back from any companies”

I hear a familiar story from a lot of aspiring data scientists: “I have sent out my resume to 25 companies, and I haven’t heard back from any of them! I have pretty good skills, and I think I have a pretty good resume. I don’t know what’s going on!”

Your resume probably sucks

My immediate conclusion after hearing your story: your resume probably sucks. If you are not getting any responses from any companies, and your skills are a reasonable match for the job description, then it almost certainly means that you are getting sabotaged by a bad resume.

What is the purpose of a resume?

The only real purpose of a resume is to get job interviews. That’s it. The purpose of a resume is not to:

list all of your job experience

list all of your technical skills

show off your great educational background

Your resume should explicitly include only the exact items that will help you get a job interview.

What makes a good resume?

A good resume tells a story that is targeted to the job description and company. And furthermore, someone reading the resume should be able to understand that story in less than 20 seconds. If you keep these principles in mind, it is actually not too hard to write a decent resume.

Crafting your story

The first thing you need to do when creating your resume is to come up with your story. This part is a little tricky, but it is extremely important and even a little fun. Your story should be simple and compelling, and it should be a good fit for the job description. A good strategy for coming up with a great resume story is to think of two or three things that are interesting about yourself and tie them together. You should start out with a simple story in plain English (something I learned in Ramit Sethi’s excellent Dream Job course).

For example, a story that works for me is, “I am an experienced data scientist, I have a great math background, and I am good at explaining complicated stuff.” If I were applying to a more software development-focused data science job, a possible story for me could be, “I have experience building really fast and accurate machine-learning models in Python. I also understand big data technology like Hadoop.” For a more business-focused role, a story could be, “I have experience using stats and machine-learning to find useful insights in data. I also have experience presenting those insights with dashboards and automated reports, and I am good at public speaking.” When you come up with your story, don’t be afraid to try some different ones on for size. All of the three stories I just wrote are true about me. It’s all about positioning yourself the right way for the company.

What skills and technologies should I list?

People often ask me what skills and technologies they should have on their resume. There are really three main questions here.

How proficient do I have to be before I put a skill or technology on my resume?

Which things should I emphasize?

Which things should I not include?

Question 1: What am I allowed to include?

My general rule of thumb is that you should not put something on your resume unless you have actual used it. Just having read about it does not count. Generally, you don’t have to have used it in a massive scale production environment, but you should have at least used it in a personal project.

Question 2: What should I emphasize?

In order to decide what to emphasize, you have two great sources of information. One is the job description itself. If the job description is all about R, you should obviously emphasize R. Another, more subtle, source is the collection of skills that current employees list on LinkedIn. If someone is part of your network or has a public profile, you can see their LinkedIn profile (if you can’t see their profile, it might be worth getting a free trial for LinkedIn premium). If all of the team members have 30 endorsements for Hive, then they probably use Hive at work. You should definitely list Hive if you know it.

Question 3: What should I not include?

Because your resume is there to tell a targeted story in order to get an interview, you really should not have any skills or technologies listed that do not fit with that story. For example, if your story is all about being a “PhD in Computer Science with deep understanding of neural networks and the ability to explain technical topics,” you probably should not include your experience with WordPress. Including general skills like HTML and CSS is probably good, but you probably do not need to list that you are an expert in Knockout.JS and elastiCSS. This advice is doubly true for non-technical skills like “customer service” or “phone direct sales.” Including things like that actually makes the rest of your resume look worse, because it emphasizes that you have been focused on a lot of things other than data science, and — worse — that you do not really understand what the team is looking for. If you want to include something like that to add color to your resume, you should add it in the “Additional Info” section at the end of the resume, not in the “Skills and Technologies” section.

What I have no experience?

If you have no working experience as a data scientist, then you have to figure out how to signal that you can do the job anyway. There are three main ways to do this: independent projects, education, and competence triggers.

Independent projects

If you don’t have any experience as a data scientist, then you absolutely have to do independent projects. Luckily, it is very easy to get started. The simplest way to get started is do a Kaggle competition. Kaggle is a competition site for data science problems, and there are lots of great problems with clean datasets. I wrote a step-by-step tutorial for trying your first competition using R. I recommend working through a couple of Kaggle tutorials and posting your code on Github. Posting your code is extremely important. In fact, having a Github repository posted online is a powerful signal that you are a competent data scientist (it is a competence trigger, which we will discuss in a moment).

Kaggle is the simplest way to complete independent projects, but there are many other ways. There are three parts to completing an independent data science project:

Coming up with an idea

Acquiring the data

Analyzing the data and/or building a model

Kaggle is great, because steps 1 and 2 are completed for you. But a huge amount of data science is exactly those parts, so Kaggle can’t fully prepare you for a job as a data scientist. I will help you now with steps 1 and 2 by giving you a list of a few ideas for independent data science projects. I encourage you to steal these.

Use Latent Semantic Analysis to extract topics from tweets. Pull the data using the Twitter API.

Use a bag of words model to cluster the top questions on /r/AskReddit. Pull the data using the Reddit API.

Education

Another way to prove your ability is through your educational background. If you have a Masters or a PhD in a relevant field, you should absolutely list relevant coursework and brag about your thesis. Make sure that you put your thesis work in the context of data science as much as possible. Be creative! If you really can’t think of any way that your thesis is relevant to data science, then you problem should not make a big deal out of it on your resume.

Competence triggers and social proof

Competence triggers are usually discussed in the context of interviews, but they play a particularly important role in data science resumes. Competence triggers are behaviors or attributes of a person that “trigger” others to see them as competent. In an interview, a typical competence trigger is having a strong, firm handshake or being appropriately dressed. There are a few key competence triggers that will really boost your resume:

A Github page

A Kaggle profile

A StackExchange or Quora profile

A technical blog

Why do these boost your resume so much? The reason is that data scientists use these tools to share their own work and find answers to questions. If you use these tools, then you are signaling to data scientists that you are one of them, even if you haven’t ever worked as a data scientist. Even better, a good reputation on sites like StackExchange or Quora gives you social proof.

Don’t worry about doing all of these at once. I absolutely think you should have a Github page, and you should post code from your independent projects there. If you have performed decently well in a couple of Kaggle competitions, then your Kaggle profile will be impressive, too. Answering questions on StackExchange or Quora can be a bit of a distraction from your real work, so it should not be a priority. And starting your own blog is great, but probably not necessary. As an alternative to a blog, you can focus on writing good documentation in a README in your Github repositories.

Resume rules of thumb

As you write your resume, there are a few basic rules of thumb to keep in mind.

Keep it to one side of one page: Most recruiters only look at a resume for a few seconds. They should be able to see that you are a good candidate immediately, without turning the page.

Use simple formatting: Don’t do anything too fancy. It should not be hard to parse what your resume says.

Use appropriate industry lingo, but otherwise keep it simple: Again, this goes to readability.

Don’t use weird file types: PDF is good, but you should probably also attach a DOCX file. You basically should not use any other file formats, because your resume is useless if people can’t open it.

Do I need to include a cover letter?

A lot of job applications say that a cover letter is optional. Typically, you should include a cover letter anyway. Make sure the cover letter is not too generic. Actually explain why you would be a good fit for the role and the company. Do a little research, and be positive. Remember the rule about resumes, though: don’t make the cover letter too long.

If you are just “casually” sending your resume to a current employee, it is okay to skip the cover letter. This is another example of how networking is a critical. The best way to get an interview is to be recommended by a current employee. If you can do this, then your resume will float to the top of the pile automatically.

My annotated resume

If you want to see what my resume looks like, here’s a link to it in Google Drive. It’s not perfect, but it doesn’t have to be. Keep that in mind. Remember, your resume is there to get you an interview. It is not your magnum opus. Happy hunting!

PS: Be sure to sign up for my email list if you want more content like this, and please leave any questions in the Comments.