How can I show scale breaks on graphs?

Stata’s graphics commands do not include facilities for a scale break in
which either the y axis or the x axis of a graph is
interrupted. The presumption is that when faced with, for example,
outliers in a dataset you will be better advised to consider a log scale by
using a yscale(log) or xscale(log) option. Alternatively,
perhaps your data would benefit from some other nonlinear transformation
before graphing. Either way, many writers on graphics discourage the use of
scale breaks as being at best awkward and at worst difficult to interpret
correctly.

Without moralizing too much on what you should or should not be
doing, we must point out another issue. Stata’s graphics, particularly
twoway graphs, are
designed to allow you to superimpose or combine graphs that are compatible.
To allow both this and scale breaks is well nigh impossible or, at least, was
judged unworthy of the effort.

Nevertheless, there are cases when a log scale is not advisable or when you
decide that a scale break is preferable anyway. Scale breaks can indeed be
simulated in Stata to some extent with various little tricks. Let’s
look at two examples.

Population change: A “break” on the x axis

Consider these population estimates from McEvedy and Jones (1978,
342–351). The variables are year (negative values denote BCE) and
estimated world population in millions.

The sparsity of data for the earlier part of the record and the rapid rate
of increase in the last few centuries combine to produce a crowded
right-hand portion of the graph. Yet a log scale for year would
certainly not help here, as it would exacerbate the problem, even if we
could decide on an appropriate origin for log(year − origin).
(A log scale for population would be sensible, but that is a separate
question.)

The gap between the first two values of year of 5,000 years is almost
5/12 the range of that variable. We will show how to move the first value
closer to the rest of the values and thus simulate a scale break.

We will copy year and in the copy move the first value closer to the
rest, except that the value label will not lie. Then the graph can be drawn
with a vertical line to mark the break:

However, value labels can be attached only to integers; see
[D] label. A more
general trick is that we can type something like xlabel(-7000 "-10000"
-5000(1000)2000), indicating that −7000 is really −10000. We
can do that with nonintegers also. The numerical values in the variables
have been fudged for this purpose.

Another way to simulate a scale break is to plot the values separately and
then combine them into one graph. This approach creates a visible break in
the axis, but it requires more complicated graph statements.

The first graph will be the left panel. We do not want the two panels to be
the same size, so we need to specify the fxsize() option. As we are
only plotting one point, we need to specify two labels on the x axis
but specify that one of them is an explicit blank, " ", and we need
to remove the two tick marks but add one tick mark at −10000.

Spiky time series: Leaving gaps but showing outlier details too

To illustrate another approach, we make ourselves a sandbox to play in by
generating some spiky time series as the reciprocals of uniformly
distributed random numbers. We expect a minimum of 1 and a median of 2 but
will sometimes get some much larger numbers.

After a peek at summary statistics, we choose to chop values at 100 but show
higher values by text on the graph positioned just above that. In practice,
we may want to loop over responses, so we initialize what we show and where
we show it:

. gen high = ""
. gen High = 105

In a loop, we use clonevar
to keep the originals safe. We then replace the large
values with missing values but put their values into the string variable
high that we just initialized. In the graph command, note the
option cmissing(n) and the marker label options. Horizontal text
labels for the outliers are preferable whenever outliers are not too close
to inhibit that, but we leave them vertical here. In real data, outliers are
much more likely to be supported by values on both sides, so vertical may be
the best option here.