Savvy software comes from savvy people.

A Framework for Feature Engineering, Part 1: Intuition

Reading Time: 8minutes

The data science field revolves around discussions of quantification. So it’s noticeable when something in that field gets regularly described as “an art,” particularly something whose name includes the word “engineering.”

That’s the case with feature engineering—the secret sauce of the data science process that springs, in theory, from the experience and ingenuity of the people building the models.

But it’s not an esoteric art, is it? Can we articulate practices and principles around feature engineering? If we find ways to approach feature engineering as a process rather than as a bolt of genius, we’ll have an easier time consistently engineering useful features.

So what is feature engineering?

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. Feature engineering turn your inputs into things the algorithm can understand.

Amit refers to a couple of examples. One of them: extracting an “Hour of Day” feature from the flight times of flights and using that feature to predict timeliness.

So what is a feature, and why would we want to extract one?

Choosing Where to Go for Dinner

Suppose you are deciding where you would like to go (alone) for dinner. You make this decision all the time, so you have decided to build a model to do it for you.

Suppose, also, that you know a couple of things about several restaurants: their addresses and the type of food they serve.

You eat out on Fridays. You’re feeling very tired, but you’re happy that the week is over. You want a restaurant that is convenient to your home: in the same neighborhood, perhaps, or connected via public transport without any bus or train transfers. You also want a restaurant that will serve you very flavorful food, but not very spicy.

So there’s the information that’s important to you to make a decision: convenience to your house, flavor, and spiciness. Your data doesn’t have that information—at least, not directly. But it does have the prerequisites to get that information: you can figure out neighborhoods as well as transport routes based on addresses. You can fit a probabilistic model on flavor based on the type of food it is (probably more flavorful at an Ethiopian restaurant and less flavorful at a potato shack). You can do the same for spiciness (Indian food would probably come in on top at spiciness, while British food would not).

Why feature engineer?

Each restaurant has an address. There is a relationship between these addresses and where you’ll choose to go to dinner. What is that relationship? Well, if we extract neighborhoods from each address, your preference might lean toward one or two neighborhoods. If we extract public transport routes between the restaurant addresses and your address, we might notice your preference lean toward simpler or shorter routes. This might simplify even further to a binary feature: “Is this place convenient to you?” Maybe we can measure that even more precisely, for example: “Is this place within 15 minutes’ commute from your house?”

The restaurant and your address both have an indirect relationship to your choice, but the answer to this question directly informs your choice. That theoretical distinction has three practical implications for our models.

The practical reasons:

1. The right features allow us to get accurate results with fewer confounds.

In addresses, the model can run into confounding factors. What if it finds the word “Clark Street” in the addresses of restaurants you like to choose and starts up-weighting restaurants on “Clark Lane” all the way across town? That type of thing is less likely to happen if you have a numeric or categorical variable representing convenience. The removal of confounds makes the patterns in the data more apparent, which can allow models in some cases to create predictive value with fewer training examples than they could with raw features alone.

2. The right features allow us to get more robust results.

Your restaurant choice has less to do with their address or your address than the relationship between those two addresses. Suppose, now, that your lease ends and you move to another neighborhood. The restaurants’ addresses don’t change, but your preference for them does change. Why? Because different restaurants are convenient to you now, and that relationship—convenience—is the feature that matters to you. For this reason, a model that works off of a “convenience” feature is more robust than a model that works solely off an “address” feature. Engineering the right changes allows our models to adapt better to new situations or changes in underlying data.

3. The right features allow us to get accurate results with less input data.

If we can take two pieces of information—their address and your address—and reduce those to one piece of information—is this restaurant convenient for you?—then our model can get the same or better results with fewer pieces of information. Reducing the amount of information that the model has to process can mean major performance (speed) improvements for the model. Going from two pieces of information to one piece of information per example doesn’t sound like a huge savings, and maybe it’s not. This advantage becomes more material when we are working with individually data-rich examples like texts, images, videos, and sound files.

How do we engineer features?

It surprises me how little literature I have found on this topic. It’s often described as an art, but even in artistic pursuits, there are practices that we can extract to put ourselves on the path to a well-crafted piece. My hope, in the rest of this series, is to explore some options for practices that may generalize well for many machine learning problems.

For this problem, many models use the pixels of the image as features. Each of the images in the mnist handwritten number dataset comes represented as a 28 x 28 array of numbers, with larger numbers corresponding to a darker color for that pixel.

When we run a model on the pixels from 2,000 examples of that dataset, we get pretty high accuracy:

But suppose that, for some reason, we needed this model to run fast, run on less data, or generalize better. In those cases, it might be worthwhile for us to think about the kinds of features we might engineer from these handwritten numbers.

We start talking about some options for feature engineering on this dataset in the next post. In the meantime, it’s worth our time to look at those handwritten examples in the image above and consider: when we are looking at a number with our eyeballs, what is it about a handwritten “one” that allows us to recognize it as a 1? What is the essence, say, of “three-ness” that lets us recognize a 3?

For now, that’s the question that our feature engineering examples will aim to answer. Write down your conclusions so we can compare notes!

Conclusion

Feature engineering is often described as “an art,” with surprisingly little literature on how to do it well. In this series of posts, we’ll attempt to derive some feature engineering practices that we can apply to most machine learning problems.

What is feature engineering? Amit Shekhar describes it as “the process of transforming raw data into features that better represent the underlying problem.” If often takes the form of turning raw data, like addresses or food types, into more complex features like convenience or conformity to a diner’s tastes.

There are a few big reasons that feature engineering may be useful to us, one theoretical and three practical:

The right features allow us to get accurate results with fewer confounds.

The right features allow us to get more robust results.

The right features allow us to get accurate results with less input data.

The next step is to figure out what kind of framework we can apply to how we engineer features. We’ll start with an example from optical character recognition: the classification of handwritten numbers.

So, what makes a one a one? What is the “three-ness” that makes a three recognizable as a three? In the next post, we start to look at some approaches to answering questions like these.