This course introduces you to sampling and exploring data, as well as basic probability theory and Bayes' rule. You will examine various types of sampling methods, and discuss how such methods can impact the scope of inference. A variety of exploratory data analysis techniques will be covered, including numeric summary statistics and basic data visualization. You will be guided through installing and using R and RStudio (free statistical software), and will use this software for lab exercises and a final project. The concepts and techniques in this course will serve as building blocks for the inference and modeling courses in the Specialization.

강사:

Mine Çetinkaya-Rundel

Associate Professor of the Practice

스크립트

In this video on visualizing numerical data, we will discuss scatter plots for paired data and other visualizations for describing distributions of numerical variables. The data come from gapminder which pulls this information from a variety of data sources. We will be working with two numerical variables. Income per person, that's in US dollars and life expectancy, in years, for the year 2012. Each observation in this data set in a country. That data set contains data from most but not all countries, since this information wasn't available for certain countries. A common tool for visualizing the relationship between two numerical variables is a scatter plot. To identify the explanatory variable in a pair of variables, we identify which of the two is suspected in affecting the other and plan an appropriate analysis. Since we might suspect that economic wealth of a country might effect the average life expectancy of it's people, we have set up our analysis with income as the explanatory and life expectancy as their response variable. Generally, in a scatter plot, we place the explanatory variable on the x axis and the response variable on the y axis. It's very important to note that labeling variables as explanatory and response does not guarantee that the relationship between the two is actually causal. Even if an association between the two variables is identified. We use these labels only to keep track of which variable we suspect affects the other. In fact, since these data are observational and do not come from a randomized controlled experiment, we know that we can only talk about correlation and not causation between the two variables. So what is the relationship between these two variables? The best way to answer this question is to visualize a line or a curve going through a cloud of the data. So here I'm drawing a curve that first shows a positive increase in life expectancy as income increases and then the relationship levels up such that countries with income levels above a certain point still have roughly 80 to 85 years of average life expectancy. The relationship is very strong with not too much scatter around the curve. In addition, there are a few countries that stand out from the rest as potential outliers. We will discuss those in a detail in a moment. Let's summarize quickly what we've learned about relationships between numerical variables. When evaluating the relationship between two numerical variables, we should make sure to examine the direction of the relationship. Is it increasing, or decreasing. The shape of the relationship. Is it linear, or does it follow some other form? The strength of the relationship. Is the relationship strong? Indicated by little scatter. Or weak, indicated by lots of scatter. And any potential outliers. These can be individual observations or a group of observations. It's always a good idea to investigate these points carefully to make sure they're not data entry errors. Let's take a closer look at the outliers. Some of them have pretty high income levels. Luxembourg, a rich country with a small population and has higher income per person level. Macao, a special administrative region in China And Qatar, a country with a small population and lots of oil. Another potential outlier is Nepal, where the life expectancy is considerable higher than what would be expected for the low income level compared to others. These are countries that we would indeed expect to behave differently than the majority of the countries. So it's not surprising that they stand out from the rest. One naive way of dealing with outliers in data analysis is to immediately exclude them. But we're calling that approach naive because it's often not the right approach. This is a good example of when the outliers might be very interesting in cases. And handling them with careful consideration of the research question and other associated variables is important. Now, let's take a look at the distributions of the variables, individually. One good way of visualizing the distribution of a numerical variable is a histogram. In a histogram, data are binned into intervals and height of the bars represent the number of cases that fall into each interval. In other words a histogram provides a view of the data density, higher bars represent where data are relatively more common. For example we can see that majority of the countries have average life expectancies between 65 to 85 years old. histograms are also very useful for identifying shapes of distributions. In this case the distribution of life expectancies appear to be left skewed which is expected due to the leveling off of life expectancies we've identified earlier. There's a physiological limit to how long people live. And in most countries, people live up to that time but there are some countries with much lower life expectancies and fewer and fewer of these countries with lower and lower expectancies. Resulting in a long left tail. The distribution of income on the other hand is right skewed. Incomes can't be negative so we have a natural boundary at zero, but there is no real upper limit to how high incomes can go. However, as we go higher and higher we have fewer and fewer countries with such high levels of personal income resulting in a long right tail. A shared characteristic between these two distributions is that they're both unimodel. Let's focus on these statements on skewness and modality for a bit. First off, skewness. Distributions are set to be skewed to the left side of the long tail. In a left skewed distribution, the longer tail is on the left on the negative end. If no skewness is apparent, then the distribution is said to be symmetric. And in a right skewed distribution, the longer tail is on the right, the positive end. As you can see, the best way to assess the shape of distributions is to step back and imagine a smooth curve outlining the distribution, instead of focusing on the jagged edges of the bars in the histogram. Another important aspect of shape is modality. A distribution might be unimodal with one prominent peak, bimodal with two prominent peaks, or uniform with no prominent peaks. With more than two prominent peaks a distribution is usually said to be multimodal. The distribution that you will most closely work with, and in an introductory statistics course is unimodal, the normal distribution, that you may also know as the bell curve. A bimodal distribution might indicate that there are two distinct groups in your data. For example here's a distribution of heights of individuals at a preschool. The first peak might be the kids and the second might be the teachers. A uniformed distribution means there's no apparent trend in the data. That high and low values of the variable are equally likely to occur. Here's a distribution of the last digits of a random sample of people's social security numbers. As expected, the data show no trend as just as likely to have a social security number that ends with a zero, as a six or a nine. Assessing modality like shape is also best done by imagining a smooth curve outlining the distribution. Here is a trick, think of the bars as the histogram as wooden blocks and imagine dropping a limp spaghetti over them and try to imagine how the limp spaghetti would fall over and between the wooden blocks. Peaks that are further from each other will likely result in differentiable prominent peaks and peaks that are close to each other like the ones around zero and two may not. Identifying the number of modes is not an exact science, and not one that you should dwell on too much. Usually all you need to do is to determine whether the distribution is uniform Unimodal or something else. We should also note that the chosen bin width of the histogram can alter the story the histogram is telling. When the bin width is too wide, we might lose interesting details. When the bin width is too narrow It might be difficult to get an overall picture of the distribution. The ideal bin width depends on the data you're working with. So you should try playing with it until you're satisfied with the visualization. Let's go back to the life expectancy data we were working with. Another technique for visualizing such data is a dotplot. A dotplot is especially useful when individual values are of interest. However, as the sample size increases, the dotplot may get too busy. Yet another visualization technique that is especially useful for highlighting outliers is a box plot. A box plot also readily displace the median. The mid point of the distribution, this is the thick line inside the box, and the interquartile range, the width of the box. According to this box plot, the median life expectancy is roughly 73 years, and the middle 50% of countries have average life expectancies between 65 and 77 years old. In addition, countries with life expectancies that are below 48 years old are considered to have unusually low life expectancies. A box plot of the income distribution shows the same right skewed distribution we've identified before. And the outlying countries with unusually high per person income levels stand out in this visualization as well. One way of determining the skewness of a distribution from a box plot is to imagine what the histogram would look like. The peak of the distribution will be roughly around the median, and the tails will extend out to the tails in the box plot. There's one more visualization method that we will discuss in this video. An intensity map. For certain types of data, like the one's we've been working with in this video, it might be useful to view the spatial distribution. These displays reveal trends in the data, that many of the others did not. For example, we can see that both income and life expectancy are lower in Africa, but higher in North America and Europe.