Enseigné par

Google Cloud Training

Transcription

All right. So, here's our list. Good feature columns must be: related to the objective, known at prediction time, numeric with a meaningful magnitude, have enough examples present, and lastly you're going bring your own human insight to the problem. First up, a good feature needs to be related to what you're actually predicting. So, you need to have some kind of a reasonable hypothesis for why a particular feature value might matter for this particular problem. You can't just throw arbitrary data in there and hope that there's some kind of relationship somewhere for your model to figure out. Because the larger your dataset is, the more likely it is that there's lots of these spurious or strange correlations that your model is going to learn. Let's take a look at this. What are the good features shown here for horses? Oh, it's a trick question. If you said it depends on what you're predicting you're exactly right. I didn't tell you the objective for what we're after. If the objective is to find what features make for a good race horse, you might go with the data points on breed and age. Does the color of the horse's eyes really matter that much for racing? However, if the objective was to determine if certain horses are more predisposed to eye disease, eye color maybe a valid feature. The point here is that you can't look at all your feature columns in isolation and say whether or not one is a good feature. It's all dependent upon what you're trying to model or what your objective ultimately is. All right, number two. You need to know the value at the time that you're doing the prediction. Remember the whole reason to build the ML model is that you can predict with it. If you can't predict with it there's no point in building and training an ML model. So, a common mistake that you're going see a lot out there is to just look at the data warehouse that you have and take all that data, all the related fields, and then throw them all into a model. So, if you take all these fields and just thrown into an ML model, what's going to happen when you're going to go and predict with it? Well, when you go at prediction time you may discover that your warehouse had all kinds of good historical sales data, even if it's perfectly clean, that's fine. So, use that as the input for your model. Say you wanted to get how many things were sold on the previous day, that's now an input to your model. But here's the tricky part. It turns out that daily sales data actually comes into your system a month later. It takes some time for the information to come out to your store. Now, there's a delay in collecting and processing this data but your data warehouse has this information because somebody went through all the trouble of taking all the data, and joining all the tables, and putting it all in there. But at prediction time or at real time you don't have that data. Now, the third key aspect of a good feature is that all your features have to be numeric and they have to have a meaningful magnitude. Why is that you ask? Well, ML models are simply adding, multiplying, and weighing machines. When it's actually training your model it's just doing arithmetic operations, computing trigonometric functions, and algebraic functions behind the scenes and your input variables. So, inputs need to be numbers, and trickly, you're magnitudes need to have a useful meaning, say like the number two that's present there is in fact twice a number one. Let's do a quick example. Here we're trying to predict the number of promo coupons that are going to be used, and we will look at the different features of that promotional coupon. First up, is a discount percentage like 10 percent off,20 percent off, et cetera. Is that numeric? Sure, yeah. Absolutely, it's a number. Is the magnitude meaningful? Yeah. In this case absolutely. A 20 percent off coupon is worth twice as much as a 10 percent off coupon. So, there's not a problem, this is a perfect example of numeric input. What about the size of the coupon? Say, I define it as four square centimeters, 24 square centimeters, and 48 square centimeters. Is that numeric? Yeah, sure. So, is 24 square centimeters six times as much or more visible than one that's four square centimeters? Yeah, it makes sense. You can imagine this as numeric. But it's also unclear whether or not that magnitude is really meaningful. Now, if this were an ad your placing, the size of the banner ad, larger ads are better and you could argue that that makes sense. But if it's a physical coupon and something that goes out like the newspaper, and you've got to wonder whether or not a bigger 48 square centimeter coupon really is twice as good as a 24 square centimeter coupon. Now, let's change the problem a little bit. Suppose we define a size of the coupon as small, medium, and large. At that point, are small, medium, large numeric? No, not at all. Now, I'm not saying you can't have categorical variables as input to your models, you can but you just can't use strictly small, medium, and large directly. We're going to do something smart with them, and we'll take a look at how to do that shortly. All right. Let's go with the font of an advertisement. Arial 18, Times New Roman 24. Is this numeric just because it has numbers in it? No. Well, how do you convert something like Times New Roman to numeric? Well, you could say Arial is number one, Times New Roman is number two, Roboto is number three, and Comic Sans is number four. But that's a number code, they don't have meaningful magnitudes. If we said Arial as one and Times New Roman as two, Times New Roman is not twice as good as Arial. So, the meaningful magnitude part is really important. What about the color of the coupon? Red, black, blue, again, these aren't numeric values and they don't have meaningful magnitudes. We can come up with like RGB values to make the color values numbers, but again, they're not going be meaningful numerically. If I subtract two colors and the difference between them is three, is that mean if I subtract two other colors and the difference between them is also three, are these same at the commensurate? No, and that's the problem with magnitude. All right. How about item category? One for dairy, two for deli, three for canned goods. Now, as you seen before these are categorical, they're not numeric. Now, I'm not saying that you can't use non-numerical values as we said before, we simply do something to them, and we look at things that we need to do to them. So, here's an example, suppose you have a words in a natural language processing system. The things that you need to do to words to make it numeric is that you could simply run something that's called Word2vec. It's a very standard technique, and you basically take all of your words, apply this technique to make those words numerical vectors which as you know have a magnitude. So, each of these words becomes a vector. At the end of Word2vec, when you look at these vectors, the vectors are such that if you take a vector from man and you take a vector from woman and you subtract them the difference that you're going to get is going to be very similar the difference is if you take the vector for king and you take the vector for queen and if you subtract them. That's what Word2vec does. So, changing an input variable that's not numeric to be numeric, it's not a simple matter, it's a little bit of work. We can just go ahead and throw some random encoding in there but your ML model is not going to be as good as if you started with a vector encoding that's nice and understands the context of things like male and female, man and woman, king and queen. So, that's what we're talking about when you say numeric features and meaningful magnitudes. They're going to be useful so you can do the arithmetic operations on them during your ML processing phase. All right. Point number four; you need to have enough examples of that feature value in your data set. A good starting point for experimentation is that you need to have at least five examples of any value before I'll use it in my model. At least five examples of a value before you use in training, or validation, or so on. So, if we went back to our promo code example, if you want to run an ML model in our promotion codes that gave you 10 percent, you may have a lot of examples of ten 10 percent promo coupons in your training dataset. But what if you gave a few users a onetime discount code of 87 percent off? Do you think you'll have enough instances in your dataset of an 87 percent discount code for your model to learn from? Likely not. You want to avoid having values which you don't have enough examples to learn from. Notice I'm not saying that you have at least five categories like 10 percent off, 20 percent off, thirty 30 percent off, and I'm not saying that you need to have at least five examples like four records or five records in a column. I'm saying that for every value of a particular column, you need to have at least five examples. So, in this case five instances at least for 87 percent off discount coupon code that hadn't been used before we even consider using it for ML. Last but not least, bring your human insight in the problem. Recall and reverbalize and reason over responses for what makes it a good feature or not. You need to have a subject matter expertise and a curious mind to think of all the ways that you could construe a data field as a feature. Remember that feature engineering is not done in a vacuum. After you train your first model, you can always come back and add or remove features from model number two.