This course introduces you to sampling and exploring data, as well as basic probability theory and Bayes' rule. You will examine various types of sampling methods, and discuss how such methods can impact the scope of inference. A variety of exploratory data analysis techniques will be covered, including numeric summary statistics and basic data visualization. You will be guided through installing and using R and RStudio (free statistical software), and will use this software for lab exercises and a final project. The concepts and techniques in this course will serve as building blocks for the inference and modeling courses in the Specialization.

HD

The tutor makes it really simple. The given examples really helped to understand the concepts and apply it to a wide range of problems. Thank you for this. Wish I could complete the assignments too.

SS

Jul 27, 2017

Filled StarFilled StarFilled StarFilled StarFilled Star

Great course! Explained the concepts so clear and crisp and the exercises with R are great. The project reinforces all the concepts. All in all, a great course for beginners in statistics and R.

수업에서

Introduction to Probability

Welcome to Week 3 of Introduction to Probability and Data! Last week we explored numerical and categorical data. This week we will discuss probability, conditional probability, the Bayes’ theorem, and provide a light introduction to Bayesian inference. Thank you for your enthusiasm and participation, and have a great week! I’m looking forward to working with you on the rest of this course.

강사:

Mine Çetinkaya-Rundel

Associate Professor of the Practice

스크립트

In this video, we will learn to use probability trees to solve for conditional probabilities, highlighting that they're especially useful when the probability we're asked for is the reverse of what we're given. Let's start with a simple example, and then we'll work our way to more involved situations. You have 100 emails in your inbox. 60 of them are spam, and 40 are not. Of the 60 spam emails, 35 contain the word free. Of the rest, only three contain the word free. If an email contains the word free, what is the probability that it is spam? So what we want to do first is to organize this information into a probability tree. We're going to start by dividing our population, our inbox in this case is our population, into two, based on whether the email is spam or not spam. So we have 60 emails that are spam, and 40 emails that are not spam. Now that we've done this branching, we can actually further branch out from these and list how many of the spam emails have the word free in them and how many of them do not, and likewise for the no spam, non-spam emails. Of the 60 spam emails, 35 have the word free in it, and of, and the remainder 25 do not. And of the not spam emails, only three of them have the word free in it, and 37 do not. Now that we have organized the information that we're given into a probability tree, what we want to do next is to go back to the question and try to figure out what it is exactly that we're being asked for. The question is, if an email contains the word free, what is the probability that it is spam? So we know that the email contains the word free, so that's going to be our given, and we're asked for the probability that it's spam. So we can denote this as probability of spam given that the word free is in the email. Since we're saying that we know the word free is in the email, we're basically saying we can in, ignore the rest of the email. So first what we want to do is figure out how many emails in total have the word free in them. 35 of them come from the spam folder and three of them come from the not spam folder for a total of 38 and of these, only 35 of them are of interest to us because those are the spam emails. So 35 out of 38 gives us roughly 92%. Here we've implicitly made use of the Bayes theorem. What we have in the numerator is our joint probabilities, spam and free, and what we have in the denominator is the marginal probability of what we're conditioning on, the free. Except instead of working with probabilities in this case, to make things simple we've worked with counts. So what we're going to do next is actually move onto a situation where we're working with probabilities from the get go, and we don't know the sample size of the population size that we're dealing with. Swaziland, has the highest HIV problems in the world. 25.9% of this country's population is infected with HIV. The ELISA test is one of the first and most accurate tests for HIV. For those who carry HIV, the ELISA test is 99.7% accurate. For those who do not carry HIV, the test is 92.6% accurate. Note, by the way, that these probabilities are estimates. If an individual from Swaziland has tested positive, what is the probability that he carries HIV? So, we're told that 25.9% of this country's population is infected with HIV. So the probability of having HIV is 0.259. We also know something about the accuracy of the test, which seems to vary depending on whether the person has HIV or not. This is very common for medical tests. They tend to have different accuracy rates the, different accuracy rates for whether the patient has the disease or does not have the disease. This statement, for those who carry HIV, the ELISA test is 99.7% accurate, basically means that probability of testing positive, because that's what an accurate result would be if a person has HIV, so probability of positive given HIV is 0.997. This statement, for those who do not carry HIV, the test is 92.6% accurate, means probability of testing negative, because that's what accurate would mean in this case given that the patient does not have HIV, is 0.926. The question says, if an individual from Swaziland has tested positive, what is the probability that he carries HIV? So what we know is that the person tested positive. We're looking to see what is the probability that they have HIV. What we can see here is that we have a situation where we're asked for a conditional probability, and the condition has been reversed from one of the things that we are given, and we should really think about a tree diagram in this case. Those tend to be the most effective methods for getting to the answer. There are definitely other ways that you can solve this problem, and you can organize the information that's given to you. But a tree diagram tends to be one where you can really efficiently and effectively organize the information that you're given. And you're going to get to the right answer if you do it the right way. So, the first branch in the tree is always made up of marginal probabilities, since we're dividing up our population without conditioning on any other attributes. Some people in the population have HIV. That's the top branch. And others don't. That's the bottom branch. So probability of having HIV, as we saw, was 0.259 in Swaziland. And the probability of not having HIV is the complement of that, 1 minus 0.259 is going to give us 0.741. So about 74.1% of the population in Swaziland does not have HIV. Note that probabilities on a set of branches always add up to 1. Next, we move on to conditional probabilities. Let's start with the part of the population who has HIV, so we're going to be working with the top branch here. When these people take the test, they may get a positive or a negative result, because the test isn't 100% accurate. Therefore, we divide up the HIV population into two, those who test positive, and those who test negative. Based on information on the test that we were provided earlier, we know that the probability of testing positive, if someone has HIV is 0.997. Then, probability of testing negative if someone has HIV, this would be a false negative, would be the complement of that, 0.003. Similarly, among those who don't have HIV, some still test positive, and some test negative. Probability of accurately testing negative if the patient doesn't have HIV is 0.926. And the probability of a false positive, that's testing positive even though the patient does not have HIV is the complement of that, 0.074. Remember, our goal is to find the probability of having HIV, given that the patient has tested positive. Which based on Bayes theorem should be probability of HIV and positive divided by probability of testing positive. Remember, the numerator is always the joint probability, and the denominator is the marginal probability of what we're conditioning on. So far, we don't have the building blocks we need to calculate the probability that we're interested in. To get the join probabilities, like the one in the numerator, using the probability true, all we need to do is multiply across the branches. This is why a probability tree is useful. Because it organizes the information for you in a way where you'd no longer have to think, what should I multiply with what. And you, all you need to do is carry along the branches and pick up the building blocks along the way. We start with the marginal probability of having HIV and we multiply it by the probability of testing positive, given that the patient has HIV. So I'm following the first, the very top branch here, which is going to yield us the joint probability of having HIV and testing positive. So what we get is 0.259 from the first branch times 0.997 from the second branch, which gives up 0.2582. So, there's a 25.82% chance that a randomly drawn person from the Swaziland population has HIV and tests positive. Similarly, probability of HIV and negative is going to be the probability of HIV, 0.259, times the probability of negative given HIV, 0.003. That's a really tiny probability, 0.0008. We can keep going and calculate similar probabilities for the lower branch, the no HIV population as well. Probability of no HIV and positive comes out to be 5.48% and probability of no HIV and negative comes out to be 68.60, 68.62%. We've done a bunch of calculations so far, but let's go back to the task at hand. We're only interested in those who test positive, because that's what our given are. And among these, we're especially interested in those who actually have the HIV. So, the probability of HIV and positive is 0.2582, that's the numerator, the joint probability. And the denominator is comprised of two segments of the population who test positive. So the overall probability of testing positive is the sum of these, so a person can test positive because they have HIV, or even though they don't have HIV. Since we're saying or and these are disjoined probabilities to get the overall probability of testing positive, we actually add the two probabilities. The result comes out to roughly 0.82. So, to recap, we were asked if an individual from Swaziland has tested positive, what is the probability that he carries HIV? And the result we found was, probability of HIV given positive is 0.82. What this means is that there is an 82% chance that an individual from Swaziland who tested positive actually has HIV.