Statistics Introduction: Chapter 2

Probability Density Functions

Introduction to Probability Density Functions

There are many statistical distributions. In this chapter, we will examine several industrially important distributions; but, our first example is purely educational and has already been presented. It is the 2 dice example of Chapter 1. Each of the industrially important distributions has at least one Probability Density Function (PDF) similar to Ch 1 Figure 1. Some distributions have an entire family of PDFs.

Please remember that a PDF is a pictorial representation of a statistical distribution and provides clues on how to get useful answers. A PDF is thus a kind of map. Actual answers are cranked out using math methods presented later, but knowing how to do the math and what the answers mean is difficult if you don't understand the PDF.

Characteristics of a PDF:
A Probability Density Function (or PDF) is a graph that describes the likelihood of different outcomes for a chance event. Consider flipping 20 coins. You would expect about 10 heads (and 10 tails). But you would not be surprised if the result were 12 heads on Tuesday or 9 on Friday. The PDF for coin flipping describes the likelihood of both these results AND it also shows a very small likelihood (almost zero) of getting 2 or 3 heads etc.. The PDF is presented below:

Ch 2 Figure 1
Probabilities of 'Heads' Out of 20 Coin Trials

The coin flipping PDF shows all possible outcomes (which of course totals 100%). For a PDF, the area under the curve represents probability (don't look at the Y value). From Ch 2 Figure 1, we can surmise that the probability of getting 9, 10 or 11 heads is 49.6% (add the three center bar areas). The probability of getting 0 or 1 heads is remote (almost zero percent).

PDFs for all distributions are interpreted in the same manner. Please note that on any given day, the observed results of a random event may be different; but, on most days, you will observe results where the large areas indicate high probability of occurence.

The Binomal Distribution PDF

The binomial distribution is widely used by industry to solve a range of problems. The following are real industrial problems similar to ones I dealt with as an engineer that can be solved via the binomial distribution. I have also included an example showing how statistics can be used to gain historical insight or to facilitate computer game modeling.

An aircraft flight control system is controlled by four computers, each performing the same calculation. Each computer checks its answers against those of its brothers. The probabiity of any computer failing during a flight is .01 percent. The aircraft can fly safely with only two operational computers. What is the probability of an aircraft having more than 2 failures during a flight?

A WWII battleship could survive four torpedo hits without sinking. If 10 torpedos are fired, each with a 35% chance of hitting, how likely is it that the ship will be hit five or more times?

Johnny's class has 15 students. School records show that about 10% of students fail the class. How likely is it that exactly four students will fail?

A bolt factory produces batches of 12 specialty bolts. Over the last year, about 10% of bolts were rejected by Quality Control. Faced with four failures from a single batch, the plant manager wants to know, "Is this likely, or should I investigate what has gone wrong with the process?"
OPTIONAL: Forensic Statistics Use - The Negative Case

Criteria for Using the Binomial Distribution
When the following four conditions are ALL true, the Binomial Distribution can be confidently used.

The number of trials is known and (usually) less than 30 (e.g. 4 computers, 10 torpedoes, 12 bolts etc.)

Each trial has a binary result (i.e. a torpedo hits, or does not hit. A student fails or does not fail, A student eats liver, or does not)

The probability of 'success' is known (e.g. .15%, 35%, 50% etc.)

The success of each trial is independent of the other trials. This means that after a torpedo hit, the next torpedo is neither more, nor less likely to hit.)

Homework: Using the four criteria above:

Decide whether or not the binomial distribution can be used to analyze the toss of 10 coins that was discussed above (yes or no)

Write a brief list of steps that justify your answer.

LINK Check your answer here

How Many PDFs does the Binomial Have?
The following paragraph will begin to familiarize the student with the trends of the binomial distribution. Please read the material, try to make sense of it and move on. Memorization of trends is not required.

The Binomial is actually a family of PDFs. A binomial is mathematically generated, based on the probability p, and the number of samples n. Thus, each different value of the n,p number pair will generate a different PDF; but, all the PDFs do look rather similar as shown below:

Binomial PDFs - Variation of Parameters n,p
Ch 2 Figure 2

The top row of the figure above shows how the PDF changes when n=4 and p (probability of success) varies from 20% to 50% and finally to 65%. The second row presents the same data for n=8 and p varying through the same range of values. The following general observations can be made:

For the binomial family of PDFs, the general shape is similar. The PDF in all cases is much like the cross section of a bell; but, it is sometime distorted.

When n is larger, the PDF remains the same general shape but has more bars and becomes smoother.

The value of p, determines the distortion of the "bell" shape. p determines if the PDF is skewed left, right or is symetric.

The binomial distribution is discrete-- that means that the X axis is counting numbers (0, 1, 2, ... etc.)

If p less than 50%, then PDF is skewed left meaning that successes are concentrated at the "few" end of probabilities. For the top left graph, since average probability is only 20%, we should expect this.

If p greater than 50%, then PDF is skewed right meaning the successes are concentrated at the "many" end of probabilities. The large values of p means "success" will occur more often than not. We should expect larger percentages at the right end of the graph.

The Gaussian Distribution PDF

the Gaussian distribution is the most widely used statistical distribution in existance. It is typically the first statistical analysis attempted by engineers and scientists. When it doesn't work satisfactorily, they look at other distributions. The Gaussian distribution is elegant and has well defined procedures that can solve a wide range of problems.

Very often, the Gaussian is simply "presumed" to be appropriate and is applied to data. Then, engineers and scientists do a reasonableness check before proceeding further. Below are three examples of phonomina where the Gaussian distribution applies and two examples where it does not work well.

The IQ test was invented during WWI to help place solders into appropriate jobs. The IQ scores of a large sample of people is known to follow a Gaussian distribution.

When production lines manufacture parts (bolts, brackets, pistons, bearing rings etc.) it is routinely assumed that every dimension on the manufactured part will statistically vary according to a Gaussian distribution. This assumption has proven good over time and is basis for statistical process control.

A botanist's first guess would be that leaves off a certain oak tree would vary in a Gaussian fashion.

Counter Example: For many products, it has been noticed that failures of "new" items are common. But if the item survives the first few months, it is good for years. This kind of failure is called "infant mortality" and is usually modeled with TBD.

Counter Example: The wearout life of a "jack hammer" piston is likely not Gaussian. Wear-out is the result of friction and metal fatigue. These kinds of phonomina are typically fitted to a Weibull Distribution.

We just looked at some populations which are Gaussian and some that are not. Below are two clearly stated examples of industrial problems that can be solved using the Gaussian Distribution and methods that will be presented in Chapter 4. Chapter 4 will also explain the technical terms "mean" and "standard deviation". Chapter 4 will show you how to compute the mean and standard deviation for a sample of parts comming off a production line.

Percentage of Bad Product: A bolt factory makes foundation bolts that are supposed to comply with industry standard ASTM F1554 Grade 36 which requires a minimum strength of 36 thousand pounds per square inch (36 KSI). Trial runs of the production line show an average bolt strength of 45 KSI with a standard deviation of 8 KSI. If adjustments are not made, what percentage of bolts produced will fall below the required strength?

An armaments manufacturer makes cannon shells. The amount of propellant must be controlled carefully so that all shells are accurate. For a small howitzer shell, the production line can provide 8.1 pounds of propellent (the mean) with a standard deviation of .2 pounds. The specification requirement is that propellent must be between 8.0 and 8.2 pounds. What percentage of production shells will fail to meet the requirement?

Comment: The mean is another way of saying the average value. The standard deviation will be explained in chapter 4. The pair of values, (mean,standard deviation) are the key defining the Gaussian distribution for a particular product (i.e. for a population of produced items).

How Many PDFs does the Gaussian Have?
The following paragraph will begin to familiarize the student with the trends of the Gaussian distribution. Please read the material, try to make sense of it. The student should read and re-read the material until the general trends of mean and standard deviation are committed to memory.

The Gaussian Distribution is actually a family of PDFs. A particular Gaussian PDF is mathematically generated, based on the average value (mean), and the standard deviation (which we will explain mathematically in Chapter 4). Thus, each unique pair of parameters (mean,StdDev) can be used to generate a corresponding Gaussian PDF; and yet, all the PDFs do look rather similar as shown below:

The Gaussian Distribution has a family of PDF curves, each being defined by its value of mean and standard deviation. The figure above (Ch 2 Figure 3) provides the student with a clear concept of how the PDF changes as the mean increases. Basically, the Gauss PDF shifts off to the right so its center aligns with larger and larger values of the mean. This concept proves very useful because on Tueday our bolt factory may be making 36 KSI strength foundation bolts; but, on Friday we may be making high strength 150 KSI aircraft bolts. We can use the Gauss PDF model for both by simply adjusting the value of the mean!

The Gaussian Distribution has a second parameter- the "so called" standard deviation which we will study in Chapter 4. Looking at the lower part of Ch 2 Figure 3, we can see the effect of varying the Standard Deviation. The lower left graph has a standard deviation of 2.5, and is narrow. The total area under the curve is still 100%, but all the area is fairly close to the central (mean) value. In practice, this means that when such bolts are sent to build foundations, the strength values of all will be pretty close to the mean value (because all the PDF area is close to the mean value).

Examining the lower middle graph, the area is still 100%, but the graph is wider and shorter because the Standard Deviation has increased. This means that bolts built with this kind of Gaussian PDF, will put bolts with considerable strength variation into buildings and airplanes they are intended for (not so good). The lower right graph is wider still indicating large variation in properties of the bolt or other product it describes. In general, industrial processes with a large standard deviation are not a good thing.

We will study the significance of varying mean and/or standard deviation in Chapter 4; but, for now, the student is to understand that the Gaussian PDF has the ability to describe items with high values of strength, ductility, or penguin population. The Gaussian PDF is also capable of describing situations where their is little to much variation in the observed quality be it strength or Penguin population.

The X axis is a full number line with decimal fractions (It certainly has numbers like 1, 2, 3 ... but it also has everything in-between. Unlike the Binomial, it has numbers like 2.8, 5.325 and all that).

Unlike the Binomial, it is not a discrete distribution. As a result, the area under the curve is probability. The value on the Y axis has little significance.

For your information, the Gaussian PDF is an equation. The illustrations herein were created by using that equation.

Chapter 4 will explain the math of the Gaussian, and how to use it to solve industrial and historical problems.

The Weibull Distribution PDF

the Weibull distribution is a general purpose set of math that can be applied a wide range of problems that includes:

The size of particles in a smoke stack

The height of men in the United Kingdom

The wear-out life of front brake pads on a Honda Accord

The fatigue life of solder joints inside a computer

The failure life of capacitors inside a computer

etc.

The Weibull Distribution was invented in 1920?? by TBD but "lived like a recluse" until 1952 when Swedish Engineer Walodi Weibull wrote a short paper explaining its usefulness and the wide variety of problems it can handle. Like the Binomial and the Gaussian distributions, the Weibull distribution has well established procedures for use and can handle a wide variety of problems; but is generally more difficult to use than either the Binomial or the Gaussian distributions. The Weibull has become the "darling" of the reliability world because it handles failure and wear-out problems very well (i.e. its mathematical form corresponds closely to a wide range of equipment failure data).

Like the Gaussian Distribution, the Weibull PDF is generated by an equation which accepts any kind of decimal number (3.2, 5.431, 176, Pi etc.). As a result, it is a continuous distribution (not discrete), and areas under the PDF are equivalent to probability. The Y axis values have little or no meaning.

The Weibull distribution if particular interesting because
WORK IN PROGRESS ---

End of Chapter 2

Please use the BACK ARROW at top left of your browser to get back to the main statistics lessons.