Welcome to the Coursera specialization, From Data to Insights with Google Cloud Platform brought to you by the Google Cloud team. I’m Evan Jones (a data enthusiast) and I’m going to be your guide.
This first course in this specialization is Exploring and Preparing your Data with BigQuery. Here we will see what the common challenges faced by data analysts are and how to solve them with the big data tools on Google Cloud Platform. You’ll pick up some SQL along the way and become very familiar with using BigQuery and Cloud Dataprep to analyze and transform your datasets.
This course should take about one week to complete, 5-7 total hours of work. By the end of this course, you’ll be able to query and draw insight from millions of records in our BigQuery public datasets. You’ll learn how to assess the quality of your datasets and develop an automated data cleansing pipeline that will output to BigQuery. Lastly, you’ll get to practice writing and troubleshooting SQL on a real Google Analytics e-commerce dataset to drive marketing insights.
>>> By enrolling in this specialization you agree to the Qwiklabs Terms of Service as set out in the FAQ and located at: https://qwiklabs.com/terms_of_service <<<

講師

Google Cloud Training

字幕

Again, I'm Evan Jones, one of the course designers for Data to Insights. I've been teaching data analysis for over ten years. My life at Google before developing courses like this one was in Google Finance, where we built pretty fun machine learning models to predict and optimize expenses here at Google. And I'm thrilled that Google has made their internal petabyte scale data analysis tools available to the world, through the Google Cloud platform. It's that platform that we're going to be using to explore and derive insights using their big data tools. Let's take a quick look at the agenda of topics we're going to cover. First, we'll start with the basics of Google Cloud platform, and why letting the cloud handle your compute and storage needs enables massive scalability. After the fundamentals of cloud, we'll go into the big data tools, available to you as a data analyst. We're going to focus on BigQuery, Google Data Studio, and Cloud Data Prep to start. Third is where we'll start coding in SQL, or the Structured Query Language. With BigQuery, we're going to go through a few demos and interactive labs. And this is what we introduce our real world financial data set and over a 130 million US charities. Fourth, we'll explore the BigQuery pricing model for query processing and data storage. Next stop is a discussion on dirty data and how we can clean it up with SQL or a new UI tool. Sixth and seventh on this list is how you can create and store your own data sets of BigQuery, from your queries or from external data sources. We'll close here with an introduction to visualization and how to create reports from your data within Data Studio. Moving on to some of the more advanced topics we're going to cover, you're going to look at joins and unioning your data sets together in BigQuery. As some of the more advanced statistical functions and user defined functions you may not have seen before. Afterwards is one of my favorite sections on how repeated fields in a race work with in BigQuery's nested data structures. Again here we're close with some more advance data visualization tips within Data Studio. In this last sections, we'll walk through one of the most popular topics which is troubleshooting query and dataset performance. Next after that is a new topic for many, which is running queries and exploring data sets outside of the BigQuery UI and inside of Cloud Notebooks for more collaborative and advanced analysis like machine learning. Lastly before wrapping up we'll close this specialization with a critical topic of data security and access control before we wrap up. This class is targeted primarily at data analyst who query their business dataset using SQL in creating site reports and dashboards. To start off, I like to lead with, and if you're going to fall asleep at any point during this course, wake back up when you see the golden key here or the person falling into the washing machine. Those lines represent key message slides or key pitfalls that you want to avoid. Those are particularly the slides that have those images on are the slides that you don't want to avoid. But hopefully, I keep you entertained and awake throughout the rest of this specialization, so let's dive right in. So first and foremost, we're going to take a look at those challenges that are faced by data analysts. So let's just jump right into those. So if you run any queries in your life predictably like when I was learning database processing in school. My instructors and teachers would say, hey, run this one query and then you can go to the bathroom or do whatever you need to do while your query is running, all right? So upper left you see that queries are taking to long, like pontentially stalling your analysis. Or what about if I wanted to combine 15 data sources and query all of them. And I want to do that within reasonable amount of time. A lot of times, that was hard to do. And in the middle say, it wasn't a querying problem it was actually an infrastructure problem. I'm a data analyst, a data scientist, I'm not a hardware purchasing department. I don't know about buying servers, and storing multiple versions of hard drives that are redundant in case of a hard drive product fails. And I have to maintain the network of all of my data as it relates to processing my queries and accessing the data where that's stored. I dont want to deal with any of that kind of infrastructure, right? But I have to, as a necessary evil if I want to be a big data shop, right? Or if you're using, say, like a Hadoop on your clusters, you're managing your clusters but you've had this amazing capital outlay to get this awesome processing cluster. But now you're punished by your own success because now your clusters can't scale. Because your organization says, you did such an amazing job, now we have ten times the data, can your clusters handle it or do you need to buy more and kind of keep expanding out your ever growing infrastructure empire? And again, it's how much of the business of building infrastructure do you want to be in versus spending that opportunity cost of infrastructure versus writing out those amazing queries or those machine learning models to get those insights. Lastly, is pretty apparent one, which is just to get cost. So maybe you have a ton of data, you have a torrent of data but you literally can't afford to just process all of it. Just because performance wise, it's prohibitive on your machines and you can only create a few columns. Or it's just the monetary cost, which processing that much data and storing that much data is just prohibitive. And last but not least, if you have no central place where you can just dump all this data into like a staging area or an analytics warehouse, that could be a problem as well. And when they go into, these are a lot of the same exact problems that Google had kind of growing up, right? And faced with a torrent of search indexing data and adds volume data. The necessary problems that Google as a big data organization had to solve. And we'll see exactly how they did that, and the benefits of technology and time that have evolved to create a lot of these cool Google Cloud platform tools, the big data tools like BigQuery. But what I would love to start with this course out with is just kind of a general, here is where we going to get to by the end of this specialization. So it doesn't make sense now, don't worry about it, but I want to show you some of the coolness factor first. So flipping over to BigQuery. So BigQuery is that web browser based, in this particular case, we're using the web UI. You can access it through APIs and through the terminals that you want. And this query that I have, again the query is just a question that we're asking our database. In this particular case, don't be overwhelmed with a lot of things that are going on in the screen. But all you gotta know is that there is 146 Million records of New York City taxi cab trips, that's about little over four gigabytes of data. And the sophisticated query here that was written by Google Public Dataset BigQuery team who publishes this prebuilt queries in data sets, we'll get in to that a little bit more in a minute. Pretty much says, let me know how fast New York City cabs have gone in 2015 over 146 million rows. So the terrors of recording live demo but let's see how fast BigQuery can process this much data. So it's operating over 18 gigabytes, performing the calculation and over the course of 24 hours, you can just see the amount of speed and some of this is miles per hour that the cabs went. So you can see earlier in the morning right around 5 AM, they got 21 miles an hour. And 21 miles an hour seems to be the fastest because in New York City, everyone is starting to wake up, and then right around early afternoon things start to get clogged. I'm imagining because of traffic, right? But as you can see the meta point is that 4.35 gigabytes was processed in just over 6 seconds. In this query we wrote and we run. We didn't have to worry about creating, we being the data analysts community like you and me. We didn't have to worry about creating all the underlying infrastructure and building the server boxes. We just wanted to write cool queries on fun datasets, and get those results. Now that's BigQuery. We're going to go over BigQuery as one of the big data tools, one of the many that we're going to cover in this specialization. Don't worry about the different buttons and the things that you see here, we're going to walk through every single thing. And even the query itself, if it looks like Greek to you now, I'd give it a few hours into this course. It'll start making a lot more sense as we go through a SQL, which is the coding language for asking questions for databases exercise as part of the third module for this course. Anyways, let's talk about that's the end state, so processing massive amounts of scale of a data at speed, that's the end state. Let's talk a little bit about the history of the Google Cloud and how that led to processing really fast queries.