If you have a small dataset, each individual data-point can be displayed
which, of course, fully shows the distribution of data. Here are 10 data-points
sampled from a normal distribution:

However, with more numerous datasets, the point symbols will overlap
making the full display of every data point difficult to interpret.
These effects can be mitigated by using smaller point symbols and
by randomly "jittering" them to spread them out in the horizontal
direction. Here are 100 data-points
sampled from a normal distribution:

Clearly even this "bee swarm" approach to full disclosure
of the dataset has its limits and we must seek some other approach
to displaying the distribution of data.

Descriptive statistics are used to
summarize the distribution of our data. For example, our measurements
of the size of 100 maple leaves might be summarized by reporting a
typical value and a range of variation. This data can be reported
in the form of a plot with "error bars". For example, if 100 maple
leaves were collected from three different
sites (parking lots, prairie, and the woods) we can display typical values and ranges
of variation:

This plot shows that the typical leaf from parking lots was small, but there was
a lot of variation. It is likely that the largest parking lot leaf
was larger than the smallest prairie leaf.

Notice that both datasets are approximately balanced around
zero; evidently the mean in both cases is "near" zero.
However there is substantially more variation in
A2 which ranges approximately from -6 to 6
whereas A1 ranges approximately from -2½ to 2½.

One case of particular concern is when the data is distributed
into "two lumps" rather than the "one lump" cases we've considered so far.

The "bee swarm" plot shows that there are lots of data near 10 and 15
but relatively few in between. See that a box plot would not give you
any evidence of this.

A cumulative fraction plot shows the number of points included increases
rapidly near 10 and 15, whereas there are hardly any new points between
12 and 13.

Percentile Plot

Estimated Distribution Function Ogive

related keyword: Order Statistics

The steps of the cumulative fraction plot look strange to our
eyes: we are used to seeing continuous curves. Of course the
steps are exactly correct: just above a data-point there is
one more included data-point and hence
a bit more cumulative fraction then just below a data-point.
We seek something quite similar to cumulative fraction, but
without the odd steps. Percentile is a very
similar idea to cumulative fraction. If we have a dataset
with five data-points:

{-0.45, 1.11, 0.48, -0.82, -1.26}

we can sort this data from smallest to largest:

{ -1.26, -0.82, -0.45, 0.48, 1.11 }

The exact middle data-point (-0.45) is called the median, but
it is also the 50th-percentile or percentile=.50. Note that
at x=-0.45 the cumulative fraction makes a step from .4 to .6.
The percentile value will always lie somewhere in the step region.
In general the percentile is calculated from the point's
location in the sorted dataset, r, divided
by the number of data-points plus one (N+1).
Thus in the above example, the percentile for -.45 is 3/6=.5.
In summary:

percentile = r/(N+1)

Thus we have the following set of (datum,percentile) pairs:

{ (-1.26,.167), (-0.82,.333), (-0.45,.5), (0.48,.667), (1.11,.833) }

We can connect adjacent data points with a straight line.
(The resulting collection of connected straight line segments
is called a ogive.)
The below plot compares the percentile plot (red) to the
cumulative fraction.

There are a couple of reasons for preferring percentile
plots to cumulative fractions plots. It turns out that
the percentile plot is a better estimate of the distribution
function (if you know what that is). And plotting percentiles
allows you to use "probability graph paper"...plots with
specially scaled axis divisions. Probability scales
on the y-axis allows you to see how "normal" the data is.
Normally distributed data will plot as a straight line on
probability paper. Lognormal data will plot as a straight line
with probability-log scaled axes. (Incidently uniformly distributed
data will plot as a straight line using the usual linear y-scale.)

The B2 data
was approximately lognormal with geometric mean of
2.563 and multiplicative standard deviation of 6.795.
In the below plot, I display the percentile plot of this data
(in red) along with the behavior expected for the above
lognormal distribution (in blue).

Similar consideration of the A2 data leads to
the following plot. Here the data was approximately
normally distributed with mean=.8835 and standard deviation=4.330
(plotted in blue).

Histograms

Consider again the bimodal dataset discussed above.
We found that data clustered around 10 and 15; that is,
there were lots of points in the range of 10 to 11 and 14 to 15,
but fewer points in similar ranges (for example 12 to 13 or 7 to 8
or 17 to 18). We can make this explicit by counting the number
of data in various "bins", i.e., ranges.

Count of Points in Various Ranges

Range:

7-8

8-9

9-10

10-11

11-12

12-13

13-14

14-15

15-16

16-17

17-18

Count:

1

8

37

41

9

6

19

29

27

17

6

A plot of the count-in-bin vs the bin-location is called a histogram.

Histograms have the great advantage of showing exactly which ranges
are highly populated and which are not. However, the count in a particular
bin will generally vary if a new set of data is collected. We can estimate
this variation in count by applying Poison statistics: the variation in count
will generally be comparable to the square root of the count. If we
express this likely variation as an error bar, the result is:

From the relatively large size of the error bars you can see that
a lot of variation is expected in this histogram. (Nevertheless
note that the expected variation will not wash out the two-humped
distribution.) As an approximate rule of thumb, expect that around 1000
data-points are needed for a relatively accurate histogram. The above
somewhat crude histogram used 200 data-points.

In constructing a histogram you must choose the bins.
Narrow bins will collect few data-points and will therefore show relatively large
variation. Large bins may lump together different regions which
are really different, thus distorting (muting) the real distribution
of the data. Obviously the choice of bins affords you the
opportunity to Lie with Statistics.

Most commonly bins are chosen to be equally sized. However, this
is not a requirement. When using non-uniform bin size, plot the
probability density:

probability density = (fraction of data in bin)/(bin size)

Note that since the fraction of data in a bin will be the difference
in the cumulative fraction at either side of the bin, the
probability density is the slope of the secant line that connects
the bin sides on a cumulative fraction plot (slope = rise/run).
Approximately speaking, the histogram plot is the derivative
of the cumulative fraction plot. Large histogram values
(i.e., highly populated bins) correspond to regions of high slope
on the cumulative fraction curve.