Преподаватели

Google Cloud Training

Текст видео

Hi. My name is Carl Osipov, and I'm a program manager at Google. I work with our customers that use Google Cloud, and I help them succeed with deploying machine learning systems that are scalable and production ready. This section of the module covers input data preprocessing and feature creation, which are two techniques that can help you prepare a feature set for a machine learning system. To get started, you'll take a look at examples of pre-processing and feature creation, and learn about the challenges involved in applying these techniques as part of feature engineering. Then, in the remaining two parts of the section, you will see how tools like Google Cloud Dataflow and Cloud Dataprep can help you with these challenges. Okay. First, here are a few examples that will give you some intuition as to when you should use preprocessing and feature creation. Some values in a feature set need to be normalized or rescaled before they should be used by machine learning by the AML. Here,a scaling means changing a real valued feature like a price to a range from zero to one using the formula shown. Rescaling can be done for many reasons. But most of the time, it's done to improve the performance of ML training, specifically, the performance of gradient descent. Notice that to compute the rescaling formula, you need to know both the minimum and maximum values for a feature. If you don't know these values, you may need to preprocess your entire dataset to find. Preprocessing can also be useful for categorical values in the datasets like names of cities as shown in the code snippet on the slide. For example, to use a one hadan coding technique in TensorFlow, which will help you represent different cities as binary valued features in your feature set, you can use the categorical_column_with_vocabulary_list method from the layers API. To use this method, you need to pass out a list of values, which in this example are different city names. If you don't have this dictionary of values for a key, you may also want to create it, was a preprocessing step over the entire dataset. In this module, you'll learn about free technologies that will help you implement pre-processing. BigQuery and Apache Beam will be used to process the full input dataset prior to training. This covers operation like excluding some data points from the training dataset and also, computing summary statistics and vocabularies over the entire input dataset. Keep in mind that for some features, you will need statistics over a limited time window. For example, if you need to know the average number of products sold by a website over the past hour, for these types of time-windowed features, you will use Beam's batch and streaming data pipelines. Other features that can be pre-processed one data point at a time can be implemented either in TensorFlow directly or using Beam. So, as you can see, Apache Beam and the complementary Google Cloud technology called Cloud Dataflow will be important to this part of the module. So, first, I will describe some limitations of using only BigQuery and TensorFlow for feature engineering, and then, explain how Beam can help. BigQuery is a massively scalable, very fast, and a fully managed data warehouse available as a service from Google Cloud. BigQuery can help you as feature engineering because it lets you use standard sequel to implement common preprocessing tasks. For example, if you are preprocessing a dataset was 10 billion records of text he writes in New York City, some of the records may happen to have bogus data like expensive rides, showing a distance of zero miles. You can write a sequel statement to filter out the bogus data from your training examples dataset and run the sequel on BigQuery in seconds. Of course, you can also write other statements using standard sequel math and data processing functions. These can be valuable for simple calculations like additions over source data and also, for parsing common data formats, for instance, to extract details about the time of date from records with timestamps. If you do decide to use sequel to pre-process training examples, it is absolutely critical that you take care to implement exactly the same preprocessing logic in TensorFlow. Next, you will see two approaches for how to write this pre-processing code in TensorFlow. In practice, you may find yourself using the first, or the second approach, and sometimes you may use both. Keep in mind that many common preprocessing steps can be written using one of the existing methods from the TensorFlow of feature columns API. For example, if you need to change a real value feature into a discrete one, you can use the bucket dice column method. If the feature pre-processing step that you need is not available in the TensorFlow APIs, you can modify the functions used in the input parameters during training, validation, and test. The upcoming slides will explain this in more detail. Was the first option you implement your own pre-processing code. In this example, the pre-processing code is packaged in the add engineered method and the implementation does not need any global statistics from the source dataset. To compute the euclidean distance feature from the existing latlong coordinates for datapoint, the code just returns the original features dictionary along with the new feature value computed using the distance formula. To ensure that the euclidean distance feature gets included during training, evaluation, and, serving steps, all of the corresponding input_fn functions wrap the call to add_engineered method around the an pre-processed feature set. If the pre-processing step that you need already exists in the TensorFlow API, you're in luck because you can simply call the appropriate helper methods when defining your feature columns list. In this example, the bucketized_column method is used to take the latitude coordinates from the source data, and make sure that the values are in the range from 38 and 42. Next, the original values for the latitude are placed into one of the several mutually exclusive buckets such at the number of the buckets in the range is controlled by the end bucket's parameter. Maintaining pre-processing code in sequel for BigQuery and in TensorFlow can get complex and difficult to manage. As you can see on earlier, one of the advantages of using Apache Beam to pre-process features is that the same code can be used during both training and serving of a model. However, when using Apache Beam, you will not have access to the convenient helper methods from TensorFlow. This means as shown in this example that you will need to implement your own pre-processing code. In this part of the module, you have reviewed specific examples where Apache Beam can help you was pre-processing.