This course introduces you to sampling and exploring data, as well as basic probability theory and Bayes' rule. You will examine various types of sampling methods, and discuss how such methods can impact the scope of inference. A variety of exploratory data analysis techniques will be covered, including numeric summary statistics and basic data visualization. You will be guided through installing and using R and RStudio (free statistical software), and will use this software for lab exercises and a final project. The concepts and techniques in this course will serve as building blocks for the inference and modeling courses in the Specialization.

HD

The tutor makes it really simple. The given examples really helped to understand the concepts and apply it to a wide range of problems. Thank you for this. Wish I could complete the assignments too.

SS

Jul 27, 2017

Filled StarFilled StarFilled StarFilled StarFilled Star

Great course! Explained the concepts so clear and crisp and the exercises with R are great. The project reinforces all the concepts. All in all, a great course for beginners in statistics and R.

Na lição

Introduction to Probability

Welcome to Week 3 of Introduction to Probability and Data! Last week we explored numerical and categorical data. This week we will discuss probability, conditional probability, the Bayes’ theorem, and provide a light introduction to Bayesian inference. Thank you for your enthusiasm and participation, and have a great week! I’m looking forward to working with you on the rest of this course.

Ministrado por

Mine Çetinkaya-Rundel

Associate Professor of the Practice

Transcrição

In this video, we will define what we mean by independent events, learn ways of assessing independence, and introduce the multiplication rule for independent events. Two processes are said to be independent if knowing the outcome of one provides no useful information about the outcome of the other. For example, knowing that the coin landed on a head on the first toss, does not provide any useful information for determining what the coin will land on in the second toss. The probability of a head or a tail on the second toss is .5, regardless of the outcome of the first toss. Therefore, outcomes of two coin tosses are said to be independent. On the other hand, knowing that the first card drawn from a deck is an ace does provide useful in, useful information for calculating the probabilities of outcomes in the second draw. This is for drawing the cards without replacement, in other words not putting the cards back into the deck after we draw them. For example probability of drawing yet another ace is going to be 3 over 51. We have 51 cards left in the deck, and only three of them are aces. While the probability of drawing a jack is going to be 4 over 51, since we all, still have four jacks left in the deck. Therefore, outcomes of two draws from a deck of cards, without replacement are dependent. Based on this definition, we can develop a general rule for checking for independence between random processes. If the probability of an event A occurring, given that event B occurred is the same as the probability of event A occurring in the first place, then events A and B are said to be independent. This rule basically says that knowing B tells us nothing about A. Note that we use this vertical line notation to mean given. Meaning the probability of A given B. So let's put that rule to use real quick. In 2013 Survey USA interviewed a random sample of 500 North Carolina residents, asking them whether they think widespread gun ownership protects law abiding citizens from crime, or make society more dangerous. 58% of all respondents said it protects citizens. 67% of white respondents, 28% of black respondents, and 64% of Hispanic respondents shared this view. Based on these we want to fill in the blank in the following sentence. Opinion on gun ownership and race ethnicity are most likely, which of the following? Complementary, mutually exclusive, independent, dependent, or disjoint. These should all be terms that you're familiar with by now. Let's take a look at what we're given. We're given that the probability that a randomly chosen resident believes that guns protect citizens is 0.58. We also know that if the resident is white, then this probability is 0.67. Once again, we use this vertical line notation to say the probability that somebody believes that guns protect citizens, given that they're white. And that probability is 0.67. If they're black, the probability is 0.28, and lastly, if the resident is Hispanic, the probability that they believe that guns protect citizens is 0.64. Since the probabilities of thinking that guns protect citizens vary greatly based on the person's race or ethnicity, Opinion on gun ownership and race ethnicity are most likely dependent. So knowing somebody's ethnicity might actually give us useful information about their opinion on guns, and therefore, we are saying that the two variables are most likely dependent on each other. We've been using wording like most likely dependent since we're working with sample data. And we're not yet using statistical inference tools that allow us to take the results that we get from my sa, from our sample and expand that to the population at large. If we observe a difference between the conditional probabilities that we calculate based on the sample, we say that these data suggest dependence. The next natural step would then be to actual conduct a hypothesis test. To see if what we observe these difference that we observed, could have just happened due to chance or natural random sampling. Or, if there's actually a real difference in the population. We've done a little bit of that at the end of the last unit, and we're going to get back to doing that in the next unit as well. But for now we're kind of picking up building blocks to get us there. However, before we get there, we can actually do a little bit of speculating based on the magnitude of the differences that we observe as well as the sample size. For example, if the observed differences between the conditional probabilities, this is kind of like the probabilities we were just looking at. Probability that guns protect citizens, given that somebody's white versus, given that they're black versus given that they're Hispanic, if these conditional probabilities varied greatly, in other words the differences are large. Then there is stronger evidence that the difference is real. That we would see something similar to that, had we had data from the entire population as well. On the other hand, if the sample size is large, even small differences is in conditional probabilities might provide strong evidence of a real difference. Now that we know how to check for independence, let's see what we can do with events once we find out they're, that they're independent. The product rule for independent events says that if A and B are independent, then the probability of A and B happening is simply the product of their probabilities. Say you coss, toss a coin twice. What is the probability of getting two tails in a row? Sounds pretty simple eh? The probability of two tails in a row is simply going to be the probability of a tail on the first toss times the probability of a tail on the second toss. We've seen before we've talked about before that coin tosses are independent of each other. Therefore we're ab, we are able to apply this rule that we've just learned. Probability of tail on either toss is simply 0.5 or 1 over 2. So the overall probability is going to be a quarter or about 25%. A quick note, this rule isn't really limited to just two events. And it can actually be expanded to as many independent events as you need. So if, instead of doing two coin tosses, we had a hundred of them. We could simply multiply a hundred of the same probabilities together. Generically said, if A1, A2 all the way through Ak are independent, then probability of all of these events happening at once is simply going to be the product of the individual probabilities of the events. Let's put what we just learned to use with some real data. A 2012 Gallup poll suggests that West Virginia has the highest obesity rate among US states, with 33.5% of West Virginians being obese. Assuming that the obesity rates stay constant, what is the probability that two randomly selected West Virginians are both obese? We're given that 33.5 % of West Virginians are obese which we can denote as probability of being obese as 0.335. It's often useful to make lists of the givens and the problem, as we have been doing in the past couple examples. This helps to keep everything neat and organized and then it help, makes it easier for you to refer back to these values when you need them later in your calculations. We're told that the two individuals are randomly selected. Which means that they're going to be independent of each other which, with respect to their obesity status. For example, if we pick two people from the same household and one is obese, the other one might be more likely to be obese as well, given that people who live in the same household are more likely to have shared eating and exercising habits. However, since we're randomly selecting these individuals, we can say that they're independent. And since the two are independent, the probability of both of them being obese will simply be the probability, will simply be the probability of the first one being obese times the probability of the second one being obese, each of which is 0.335. Resulting in an 11% chance of two randomly selected West Virginians being obese. This value, 11% of the probability of both of these people being obese, is less than the probability of either of them being obese. Which makes sense. For two reasons. Mathematically speaking, we're multiplying two values between zero and one. So the product will necessarily be a value lower than either one of them. And conceptually we want to find two people that fit a certain criterion, at the same time. Therefore, the likelihood of us getting what we want should be lower than the likelihood of getting just, finding just one person who fits that criterion. Reasoning through the final numerical answer this way is often useful. It helps us, really understand why the formulas that we're using work the way they do without getting in to theoretical proofs. And it's also useful for checking the final numerical answer in the context of the data that you're working with. In other words, it's really a good way to check your work.