Fitting a Curve through a Scatter Plot

PROC TRANSREG can fit curves through data and detect nonlinear relationships among variables. This example uses a subset of
the data from an experiment in which nitrogen oxide emissions from a single cylinder engine are measured for various combinations
of fuel and equivalence ratio (Brinkman, 1981). This gas data set is available from the Sashelp library. The following step creates a subset of the data for analysis:

title 'Gasoline and Emissions Data';
data gas;
set sashelp.gas;
if fuel in ('Ethanol', '82rongas', 'Gasohol');
run;

The SOLVE
algorithm option, or a-option, requests a direct solution for both the transformation and the parameter estimates. For many models, PROC TRANSREG with
the SOLVE a-option can produce exact results without iteration. The SS2
(Type II sums of squares) a-option requests regression and ANOVA results. The PLOTS=
option requests plots of the variable transformations, a plot of the observed values by the predicted values, and a plot
of the residuals. The dependent variable NOx was specified with an IDENTITY
transformation, which means that it will not be transformed, just as in ordinary regression. The independent variable EqRatio, in contrast, is transformed by using a cubic spline with four knots. The NKNOTS=
option is known as a transformation option, or t-option. Graphical results are enabled when ODS Graphics is enabled. The results are shown in Figure 104.1 through Figure 104.5.

PROC TRANSREG increases the squared multiple correlation from the original value of 0.00917 to 0.82429. Iteration 0 shows
the fit before the data are transformed, and iteration 1 shows the fit after the transformation, which was directly solved
for in the initial iteration. The change values for iteration 0 show the change from the original EqRatio variable to the transformed EqRatio variable. For this model, no improvement on the initial solution is possible, so in iteration 1, all change values are zero.
The ANOVA and regression results show that you are fitting a model with 7 model parameters, 4 knots plus a degree 3 or cubic
spline. The overall model fit is identical to the test for the spline transformation, since there is only one term in the
model besides the intercept, and the results are significant at the 0.0001 level. The transformations are shown next in Figure 104.2.

Figure 104.2: Transformations

The transformation plots show the identity transformation of NOx and the nonlinear spline transformation of EqRatio. These plots are requested with the PLOTS=
TRANSFORMATION option. The plot on the left shows that NOx is unchanged, which is always the case with the IDENTITY
transformation. In contrast, the spline transformation of EqRatio is nonlinear. It is this nonlinear transformation of EqRatio that accounts for the increase in fit that is shown in the iteration history table.

Figure 104.3: Residuals

The residuals plot in Figure 104.3 shows the residuals as a function of the transformed independent variable.

The "Spline Regression Fit" plot in Figure 104.4 displays the nonlinear regression function plotted through the original data, along with 95% confidence and prediction limits.
This plot clearly shows that nitrous oxide emissions are largest in the middle range of equivalence ratio, 0.08 to 1.0, and
are much lower for the extreme values of equivalence ratio, such as around 0.6 and 1.2.

Figure 104.4: Fitting a Curve through a Scatter Plot

This plot is produced by default when ODS Graphics is enabled and when there is an IDENTITY dependent variable and one non-CLASS
independent variable. The plot consists of an ordinary scatter plot of NOx plotted as a function of EqRatio. It also contains the predicted values of NOx, which are a function of the spline transformation of EqRatio (or TEqRatio shown previously), and are plotted as a function of EqRatio. Similarly, it contains confidence limits based on NOx and TEqRatio.

The "Observed by Predicted" values plot in Figure 104.5 displays the dependent variable plotted as a function of the regression predicted values along with a linear regression line,
which for this plot always has a slope of 1. This plot was requested with the OBP or OBSERVEDBYPREDICTED suboption in the
PLOTS=
option. The residual differences between the transformed data and the regression line show how well the nonlinearly transformed
data fit a linear-regression model. The residuals look mostly random; however, they are larger for larger values of NOx, suggesting that maybe this is not the optimal model. You can also see this by examining the fit of the function through
the original scatter plot in Figure 104.4. Near the middle of the function, the residuals are much larger. You can refit the model, this time requesting separate functions
for each type of fuel. You can request the original scatter plot, without any regression information and before the variables
are transformed, by specifying the SCATTER suboption in the PLOTS= option.

Figure 104.5: Observed by Predicted

These next statements fit an additive model with separate functions for each of the different fuels. The statements produce
Figure 104.6 through Figure 104.9.

The ADDITIVEa-option requests an additive model, where the regression coefficients are absorbed into the transformations, and so the final regression
coefficients are all one. The specification CLASS
(Fuel / ZERO=NONE
) recodes fuel into a set of three binary variables, one for each of the three fuels in this data set. The vertical bar between
the CLASS
and SPLINE
specifications request both main effects and interactions. For this model, it requests both a separate intercept and a separate
spline function for each fuel. The original two variables, Fuel and EqRatio, are replaced by six variables—three binary intercept terms and three spline variables. The three spline variables are zero
when their corresponding intercept binary variable is zero, and nonzero otherwise. The nonzero parts are optimally transformed
by the analysis. The AFTERt-option specified with the SPLINE transformation specifies that the four knots should be selected independently for each of the three
spline transformations, afterEqRatio is crossed with the CLASS variable. Alternatively, and by default, the knots are chosen by examining EqRatio before it is crossed with the CLASS variable, and the same knots are used for all three transformations. The results are
shown in Figure 104.6.

ZERO=SUM and ZERO=NONE coefficient tests are not exact when there are iterative transformations. Those tests are performed
holding all transformations fixed, and so are generally liberal.

The first iteration history table in Figure 104.6 shows that PROC TRANSREG increases the squared multiple correlation from the original value of 0.18543 to 0.95870. The remaining
iteration histories pertain to PROC TRANSREG’s process of comparing models to test hypotheses. The important thing to look
for is convergence in all of the tables.

Figure 104.7: Transformations

The transformations, shown in Figure 104.7, show that for all three groups, the transformation of EqRatio is approximately quadratic.

Figure 104.8: Fitting Curves through a Scatter Plot

The fit plot, shown in Figure 104.8, shows that there are in fact three distinct functions in the data. The increase in fit over the previous model comes from
individually fitting each group instead of providing an aggregate fit.

Figure 104.9: Observed by Predicted

The residuals in the observed by predicted plot displayed in Figure 104.9 are much better for this analysis.

You could fit a model that is "in between" the two models shown previously. This next model provides for separate intercepts
for each group, but calls for a common function. There are still three functions, one per group, but their shapes are the
same, and they are equidistant or parallel. This model is requested by omitting the vertical bar so that separate intercepts
are requested, but not separate curves within each group. The following statements fit the separate intercepts model and create
Figure 104.10:

Now, squared multiple correlation is 0.9005, which is smaller than the model with the unconstrained separate curves, but larger
than the model with only one curve. Because of the restrictions on the shapes, these curves do not track the data as well
as the previous model. However, this model is more parsimonious with many fewer parameters.

There are other ways to fit curves through scatter plots in PROC TRANSREG. For example, you could use smoothing splines or
penalized B-splines, as is illustrated next. The following statements fit separate curves through each group by using penalized
B-splines and produce Figure 104.11:

This example asks for a separate penalized B-spline transformation, PBSPLINE
, of equivalence ratio for each type of fuel. The LPREFIX=0a-option is specified in the PROC statement so that zero characters of the CLASS
variable name (Fuel) are used in constructing the labels for the coded variables. The result is label components like "Ethanol" instead of the
more redundant "Fuel Ethanol". The results of this analysis are shown in Figure 104.11.

With penalized B-splines, the degrees of freedom are based on the trace of the transformation hat matrix and are typically
not integers. The first panel of plots shows AICC as a function of lambda, the smoothing parameter. The smoothing parameter
is automatically chosen, and since the smoothing parameters range from essentially 0 to almost 800, it is clear that some
functions are smoother than others. The plots of the criterion (AICC in this example) as a function of lambda use a linear
scale for the horizontal axis when the range of lambdas is small, as in the first and third plot, and a log scale when the
range is large, as in the second plot. The transformation for equivalence ratio for Ethanol required more smoothing than for
the other two fuels. All three have an overall quadratic shape, but for Ethanol, the function more closely follows the smaller
variations in the data. You could get similar results with SPLINE
by using more knots.