The course will begin with what is familiar to many business managers and those who have taken the first two courses in this specialization. The first set of tools will explore data description, statistical inference, and regression. We will extend these concepts to other statistical methods used for prediction when the response variable is categorical such as win-don’t win an auction. In the next segment, students will learn about tools used for identifying important features in the dataset that can either reduce the complexity or help identify important features of the data or further help explain behavior.

Reviews

Filled StarFilled StarFilled StarFilled StarHalf Faded Star

4.3 (33 ratings)

5 stars

19 ratings

4 stars

9 ratings

3 stars

3 ratings

1 star

2 ratings

From the lesson

Module 4: Curse of Dimensionality

There has been a tremendous increase in the way data generation via sensors, digital platforms, user-generated content, etc. are being used in the industry. For example, sensors continuously record data and store it for analysis at a later point. In the way data gets captured, there can be a lot of redundancy. With more variables, comes more trouble! There may be very little (or no) incremental information gained from these sources. This is the problem of a high number of unwanted dimensions. To avoid this pitfall, data transformation and dimension reduction comes to the rescue by examining and extracting fewer dimensions while ensuring that it conveys the full information concisely.

Taught By

Sridhar Seshadri

Professor of Business Administration

Transcript

So here we run into a problem which we call the curse of dimensionality. Okay. So why is more non good enough? The reason is that in two dimensions, for example, or in a single dimension and you just have a line and you can have data points on a line. You have 100 data points in a line, doesn't matter. But they can be really next to each other. Now let's say I have a script. Now, I have 100 data points. Now what happens now is distances being measured on two axis. So the points start looking further away. Now lets take a room and have a 100 data points in this room. Now what happens suddenly even though the same 100 points, the distance between the points now are even bigger. Imagine I had a 100 points in 10 dimensional space. Relating one point to another is very small because it just doesn't cover the entire space. So one of the biggest problems of having a huge number of dimensions is the distance between the points start increasing. The data starts looking very sparse. So because of that, we don't need to do an obvious problem that we cannot predict accurately. We cannot say a phenomenon captures another phenomenon. So causality is an issue. Predictive accuracy is an issue. You have to do this exercise a couple of times mentally to really figure out what is the curse of dimensionality. I'm going to apply it to three toy examples in reality. The example that I'm giving you on really not big data. But let's say I got 200 stocks and I want to do create an investment portfolio of them. So I'm going to use all the 200 stocks. At first-time we only capture the meaning that these 200 stocks have and maybe reduce them into groups of stocks before using them for creating your portfolio. So you will see. You can say value stocks. You can say growth stocks. You can say middle cap. You can probably have fixed-income. So you start saying, is that a way of grouping the stocks in some way? So then instead of working with the entire set of stocks, you start looking at a smaller number of stocks. We will do this example. I got universities and universities have got lots of features as you know. Their size, their student body, the programs they have, the reputation, the amount of research dollars they have, the male female ratio, the admission ratio, the number of campuses they have, the faculty student ratio. Which of them are important? How can they use any of this to understand, we like pay for my education and why does education costs so much? Or cost so little? A third problem think of it. I want to serve an ad. So I get visitors to my website and I want to serve an ad for buying let's say oatmeal. I want to sell it to people who are more likely, I want to present the ad to people who are more likely to buy it. So I'm looking at the conversion ratio. When somebody visits my website, which we are designing, we get a lot of data about them because there are cookies on them which will say what's your health? What's your wealth? What are your interest? How many cars you own? Where you're coming from? What the size of your family? You get data on a number of variables if you're willing to pay for it. Now the question is to serve the ad because the conversion is what you get paid for. If you ain't sure that the conversion ratio is higher, advertisers are willing to pay you more for solving their ads. So here are three simple examples that we've talked about variables which are probably a mix of numerical and categorical variables. But there are many of them, and we want to extract the most meaning out of them. He are some potential remedies. So how do you eliminate features? The common advice given to most people that maybe you plot the distribution of the features and maybe there's some features don't vary too much. Let's say the height of people in a university. By the way, depending on the university, it could be six feet or it could be five feet six inches. I found that, that in backward areas in some places, the height actually drops and the weight also drops. But as far as we're concerned, height may not be the best way of looking at a reputation of a university and therefore, we may be willing to drop it. We may be willing to drop features so we can look at features and see if they are highly correlated and the highly correlated and say the savings amount is also correlated to the number of buildings we own. We may not use both the variables. So we made decide either to combine features or drop them when they correlate very, very highly. The third method is the features are almost the same. So what I mean by that is, here are people who buy the product and here are the people who don't buy the product. You look at the distribution of wealth among them and you find the distribution is the same. So what the advice this statistics professors would give you is do a box plot side-by-side. People who buy my product. People who don't buy my product. What's their wealth? It's there a difference in the box plot of wealth of people who buy or who don't buy. People who buy my product. People who don't buy my product. Do they own a car? What percentage own a car? What percentage don't own a car? Is there a difference? There is not much a different. So you can actually do a side-by-side box plot and maybe eliminate a few variables. So in this usage, I'm saying the features are nearly the same. I'm using that terminology to say that the distribution of the features is almost the same in the population. Of course, domain knowledge. If an expert says, "Hey, it doesn't matter." At the first gut you may say, "Okay, let's not worry about that variable." So in the last module, I looked at a clock and I hope you trusted me and said these are the variables that matter for a clock. Maybe you didn't believe me. But initially, the model you built was based on the expertise of somebody who has been buying clocks a lot and have been observing the auction and I know the color of the clock is not that important. Actually may be. There may be other ways. So there are simple potential remedies of looking at features, looking at things which don't vary too much, getting rid of that. Looking at features that correlate highly and getting rid of a few. Looking at features which are very similar and getting rid of a few. Finally asking an expert does this make sense? Then proceeding. All these are remedies which any modeler will know. In the second part, that is an example which I have for example where domain knowledge has helped us a lot. But one of the projects which we're doing in the Indian manufacturing sector, what we found was there were different clusters of industries. What I did then is vent through them and saw whether these industries that are three digit classifications and four digit classifications you go and read. A rich industry relearns this cluster. You will find that based on that, we are able to say "Okay, this feature doesn't vary a lot in this cluster." So we should not use it in modeling. This feature let's say debt. Debt is not an important variable because they don't have any plan to the conman. You may say, "Okay, I'm not going to include that in modeling this particular cluster." So and we ask ourselves, does it make sense? So say you have heavy industry. Heavy industry needs plant and equipment. So I might like to use it there, but I may not use it into industry which is mostly doing outsourced work. So domain knowledge is probably under estimated, but it is always very important to check.

Explore our Catalog

Join for free and get personalized recommendations, updates and offers.

Coursera provides universal access to the world’s best education, partnering with top universities and organizations to offer courses online.