강사:

Google Cloud Training

스크립트

So we've mentioned it a few times, but let's talk a little bit about the main tool in your toolkit when it comes to diagnosing performance, and that is the Query Explanation map. Let's talk a little bit about what you're going to see here. It's a line of colored horizontal bars, and this represents the amount of time that each of your workers is spending in each of those key stages. Now because your data in your query is massively mapped and distributed, you could have multiple stages as your workers communicate between each other to get the job done. Now you can see how many rows are being processed in each stage all the way on the right. The top shows the input and then the lower right ultimately shows your output which is the results of your query. Diving deep into what each of those stages represent, let's take a look at some of the key differences. First up, are those yellow bars and this represents the amount of time that your workers waiting to actually receive work. And one of the potential pitfalls here is that you have a highly skewed data set and one of your workers is actually being bottlenecked. And the rest are just sitting around idly waiting for the work to be performed on that one worker before the rest of the work can be distributed out. Then notice here that you have two different yellow bars. The first represents the average, which is the average across all workers when they're waiting. And then the lighter yellow bar that's stacked on top of it represents the max that a single worker has waited, the longest time for them to be scheduled. And again, you're looking for clear differences between the max and then the overall average for all of your workers. Second up is read time and this represents the amount of time it takes your workers to actually read your data from disk. Third in orange is the amount of computational work that is done as part of your query. Fourth, is the amount of time that it takes to write your data to disk and this is in the case where you're creating something like a permanent table. Now let's walk through a live example. Since we've mentioned before, having a high cardinality or a high distinct number of Wikipedia editors, in this particular case makes it not the smartest query to use for a large group by. So hopping over into BigQuery, let's diagnose what is potentially wrong performance wise with the SQL that you see here. So hopping over into BigQuery, let's take a look at that query that we looked at earlier with those big group by statements on the Wikipedia contributors. And diagnose what is potentially wrong performance wise with the SQL that you see in front of you. So first and foremost, let's understand a little bit about what the query is doing. We have two queries, the outer query grabs something called the log edits which is going to be a calculated field that we're going to go into and the account individual Wikipedia contributors as the contributors. So broadly speaking, we want to get the total number of contributors and then a bucket amount based on a log function that we're going to go into for how many pages that they've actually edited on Wikipedia. So the inner function is where we're going to dedicate a lot of our time, this is inner function here, this subquery is, So this inter query, as we're going to dedicate most of our time. So here we can actually see the meat. We're pulling every single Wikipedia contributor ID, and then we're performing this log10 function on the amount of actual edits that they've made in Wikipedia. And the reason why we're doing that in log10 is so we can bucket the amount of edits that they've made. So for example, if you had a lot if you had a 100 edits or 600 hundred edits, you pretty much just count the zeros if its log base 10 and that will put you into bucket number two. If you didn't believe me, I have this handy dandy table here so you can see if you have 100 edits that's going to put us in bucket number 2. If you are prolific on Wikipedia and you have over a million edits, a million or more, that is going to put you into bucket six. So you can pretty much guess how this distribution is going to look. Maybe a lot of people that are in the one to a hundred edit range and then very few, if any at all, in the over one million edits to Wikipedia pages. But lets let the query decide that. We're going to go ahead and run the entire query and we're going to break it apart into sub-components so we can diagnose potential issues. Now it ran really fast because I cached this earlier. So what I'm going to do is show options. And instead of using cache results, especially when you're diagnosing performance issues, make sure that you're not running against cache. Or else everything seems like it's going to run fast, and that would be a really short lecture. So we're going to run, again, you can run selected. I have this table down here, I don't want to run all of those as well so I'm just running rows 1 through 10, which was our main query. About six seconds, so it's still pretty fast by BigQuery standards and then here we have the amazing insight that for a million plus which is bucket six. A million plus edits is actually 24 people within Wikipedia. Individual contributors that have written or edited Wikipedia articles more than a million times. That's an incredible insight in and of itself. But let's go ahead and diagnose what we potentially we could do better with this query. So, in the explanation map you see four stages. Each stage represents a shuffling of information and data between those workers. If you have many, many stages, 1, 2, 3, 4, 5, 6 stages, that represents a lot of communication that's happening, or passing of data between the workers. It normally just indicates there's a lot that's going on behind the scenes that we could potentially optimize. And you can just see, just like almost like a heat map of the colors here. That you see very little time needed to wait and pull up the data, but there's a large amount of time spent in actually computing the groupize and the counts of the individual functions that where actually performing on each individual Wikipedia article. See a lot of time ins have compute and maybe some time in wait, waiting for some workers to finish their computations and then a little bit of writing, actually. This is probably writing to temporary tables for your sub-query to be accessible by your outer query. But it might be a little bit more clear, if it's not clear from the results here. You have just a very simple table, 8 records, and then we have this beautiful chart that explains that 24 people have written a million or more edits. But the real matter that we can optimize is this inner query. So I'm just going to run GS that inner query. And then here you can see a lot of the performance improvements that we can make. So you can see it's already about six seconds of running and what it's actually doing, we're pulling every single contributor ID to Wikipedia. And keep in mind you have about upwards of 1,000 to 2,000 individual workers or slots inside of BigQuery. And what they like to do is they like to have every piece of data on one particular id. Now if we gave them, how many is this? This is over 3.4 million individual contributors to Wikipedia. Performing an operation over 3 million rows is not a problem, but where you get into an issue is if you have 3.4 million contributors, but you also want to do a calculation, like a Group By. A large Group By over 3.4 million different IDs is computationally very, very intensive just because you're going to have a lot of communication that's going to happen between the workers when they try to aggregate all those results back together. So if you can imagine you have one contributor, contributor number one and then we're looking at the articles that they've written, and they've written only three. Then you perform a log based ten of three and then you get the bucket. And this has to be done 3.4 million times, across a very very wide and potentially shallow data set. Because the majority of the folks on Wikipedia probably have written and edited less than say a thousand articles. So what we've actually done is we've forced BigQuery to execute this sub-query first. And although we've bucketed the log edits, we've done nothing to bucket the individual contributor IDs. So all of this is being processed, and you can see the large computational time here in the query explanation map. Now let's take a look at what's actually going on behind the scenes. So you can drill into each of these stages, And take a look at actually what's happening. So here we're reading data from Wikipedia data set, not a problem. And then we're doing the aggregation. So we're doing the count star and we're grouping by each individual contributor ID. And then once that's done for each different worker, and don't forget each different worker likes to have all the records for each different ID that you're looking at. This is the communication that happens between the workers, this is your shuffle step. So, it's writing that information to temporary storage where it's actually going to be shuffle later for more workers. Next stage two, instead of reading from the original data source, you're reading from shuffle step of previous stage one. And now we need to perform an aggregation. We need to actually sum up all the different accounts, for all the different articles, for every single one of the 3.4 million. So, this is where you're going to be spending a lot of your time. And then computing the actual log across that many records is actually not that bad. It's the group by across that many results that's very computational intensive. And last but not least, you're reading from the second shuffle step, its base is 0 here in performing that output. So you can see just generally when you can see a lot of two or three or four to five stages, that represents a lot of data communications happening in between each of your different workers. So what we can see here is also there's not too much of a bottleneck between one particular worker and the others. One bottleneck you could potentially think of is, if one author did have a million or 2 million or 3 million edits. Maybe all those IDs on one particular workers would just bottleneck it, and you would expect to see a very large, light orange out here. But there may be a little bit of it happening because the max time is just a this little much over the average time, but in my opinion, that's not too bad. So you might be asking, what are some of the things that I can do to prevent this in the future? Or if I did actually want to get those results, how can I do it in a more efficient way? And unfortunately, if you wanted to do a count or a group by every single individual Wikipedia contributor and then bucket the results, you would need to look at every single row and every single article. And for Wikipedia you've got 3.4 million individual distinct authors. So there's not much you can do. You'll eventually be doing a group by down the road. But there's two things that I want you to consider to do up front to help with your performance considerations. Number one, is if there's any pre-filtering of the data that you can do beforehand, before you run this query. For example, we're querying the entire corpus of Wikipedia right here. If instead you wanted to just query the last three years of edits, or maybe a single year, like last year or the year before. And then you got multiple years of those edits data. Any type of filtering that you can do, or potentially by region, will prevent that much data your entire data set being thrown against all those different workers as well. And the second is, you saw this processed in just over 20 seconds just for this particular sub-query here. If you are already going to perform this operation and you know by BigQuerty standards it'll still execute it no problem, and it'll just run whatever you give it. Be sure to store the results of what you are querying here into a permanent table. Or adding it on as a new column or new field in that Wikipedia data set if you had edit access to it. That's because if folks are down the road or other team members that you have in your organization want to do the same this same type of analysis that you just did here, you'll be making that available to them through the use of permanent table or a cache table. That they want then have re-perform this exact heavy computational query that you've just perform here. So to review those two things you can do. Filter as much as possible before hand. And the second is, store the results of any heavy queries like this into a permanent table, because storage is cheap. And other folks, including yourself, can query those tables down the road. So back again into the slides, here we'll just review exactly what we covered in the query explanation map. And the main takeaway here is that, grouping by something that has many distinct values equates to many forced shuffles or communication. You could think of it as communication between each of your different workers. They're also called shards, or slots. Many different names for the workers that actually perform the work in your query. And that's because again, the key thing to remember is that each individual worker and then sometimes you can have upwards of 2,000 workers working on a query at once is selfish. And likes to have a partition of your data where it has all of the IDs on its own local memory to perform those aggregation operations or any other kind of querying operations across that subset of your data, very efficiently. Now, a Wikipedia example of this was all of contributor ID 1, 2, 3 and performing a count of all their edits over time. And again, here you can think of 3.4 million different teeny little partitions of Wikipedia data against a limit of 2,000 workers. Means that you're going to have to perform these shuffle operations across multiple periods in time called stages to ultimately process through all of them and then re-aggregate them at the end. So that's just a quick review. Two more concepts that we want to cover. One, for performance reasons, we talked a little bit about this when we talked about performance remedies for the Wikipedia example that we saw. By saying hey, if you wanted to potentially just look at one particular year like last year, you could actually save that data table in a separate permanent table. Similar to that is the concept of table-sharding. Now what table-sharding does is, if you have a large data table, but you are only using a certain section of it over and over again, you shouldn't have to pay the costs performance-wise of scanning every single records. So all the way on the left, what you see is traditionally in relational database systems, you'd have a database administrator, or DBA, that would take a look at your database. And then manually partition, or pre-allocate that space in very very defined partitions on disk. As you bet, you'd pretty much have to be a hardware master to deal with that, to get those individual partitions set up and performing along with your queries. Second, later on down the road, what actually came out is you can manually shard your own tables. And, typically, as you saw with the GSOD weather examples that was covered in the joins and unions, you'll see a year prefix on the end of it. So with that one, you saw GSOD 1929, all the way up until today's data table. So it results in many, many, many different tables. But you're not scanning all those table records if you just wanted to query last year's temperature data. Now one caveat I'll introduce here for that middle section for that sharding of the tables is that there is what I'll call a transaction cost or an overhead cost that you're going to incur with creating all those different tables. And that's because there's metadata associated with each of those different tables, that BigQuery itself needs to store. So the last and best way that if you came across this issue and you just really wanted to query most recent data, like 2016, 2017 and 2018, is to actually set up a special version of your table. We will just call the date partition table. And it can be just like a normal table except that it actually has what we will call a partition column in it, and this is just like a date, time field as you see here all the way on the right. And the example I'm going to go with is if you have a massive amount of customer orders historically, going back say 30 years, but you're really only interested in querying things from the last 3 years. You want to wait and partition off the table in and of itself, so that you're not scanning over everything, as is typically the case when you're doing like a select query. And what you can do is you can actually pre-define your table when you create it. The caveat here is when you actually create the table it has to be set up this way as a partition table. And then you can specify a date column there as an automatic partitioning field where you can partition by day, partition by month, partition by year. So let's take a look at what that query would look like if you're querying from a partition table. So, going with that third best case example, where you have a single table so you don't have to worry about doing any union wild cards or anything like that. And you have a partition column set up there on each of those different days, how you would query that is through one of the reserve fields as you see there in the where clause on the left called partition time. And then what we can do there is set up a between a time stamp of such and such and then another different end point there. And then what that will do is the query will go in, it will not scan through every single record you have in your massive table. It will just only look at those partitions. So this is one of those unique cases where we talked about earlier, where your where clause will scan every single record in the table. This is one of those caveats where that's not actually the case. When you have a partitioned table automatically set up by date partitions, then the where clause can actually filter out individual partitions before finding the results of your select statement. So it's a pretty cool concept. Last up to cover. If you're really interested in performance, all of the BigQuery logs are actually outputted to Google cloud storage and then read into Stackdriver. So Stackdriver is the performance monitoring tool, not just for BigQuery logs, but also for all other Google Cloud platform products. So you can actually export all of your BigQuery logs directly into Stackdriver. And take a look at things like, your worker or slot utilization. How many queries do you have in flight, how much bytes you've actually uploaded, how much you're storing on persistent disk. And if you took that to the next level on this next slide, you'll actually see using Stackdriver logs, you can get out individual queries that are performing poorly, potentially. Or the queries that are costing your organization the most when looking at that pricing model there. And you can perform what I like to call the BigQuery audit. And if you're interested, we'll include a link in the slides that you can download to follow this tutorial.. Now one of the big tips I have for you, if you are setting up your Google Cloud project for the first time, be sure to go into billing. And as the second bullet point there says, you can actually tick the box that says export your billing loans into BigQuery. Now that will actually show you all of your auditing, all of your billing history directly queryable in BigQuery, much like you would like to run some SQL analysis on it. And then you will be able to build Google Data Studio like that one that you see on the left. It is time for a recap. So performance, since you are paying for the total amount of bytes that you are processing, you want to first limit the amount of columns that you're returning and also consider filtering out your rows by using that where filter whenever you can. Now this doesn't mean you can't write queries that process your entire data set, BigQuery of course was built for petabyte scale. You just want to be mindful of the resources that you're consuming. We then cover the query explanation map, where we get those cool visual cues of the types of work that is most demanded by your queries. Note here that a large difference between the average, and the max, could indicate a heavy data skew, which can be helped by filtering your data as early as possible. And we covered some SQL bad behavior, like selecting every record and ordering it, without using a limit clause. Finally, if you only access in recent data like the most recent events, consider using table partitioning to reduce the total amount of bytes that are scanned by your query. Now, let's troubleshoot some of this next queries in this next lab.