Example 76.2 Aerobic Fitness Prediction

Aerobic fitness (measured by the ability to consume oxygen) is fit to some simple exercise tests. The goal is to develop an equation to predict fitness based on the exercise tests rather than on expensive and cumbersome oxygen consumption measurements. Three model-selection methods are used: forward selection, backward selection, and MAXR selection. Here are the data:

Output 76.2.1 shows the sequence of models produced by the FORWARD model-selection method.

Output 76.2.1
Forward Selection Method: PROC REG

The REG Procedure

Model: MODEL1

Dependent Variable: Oxygen

Forward Selection: Step 1

Variable RunTime Entered: R-Square = 0.7434 and C(p) = 13.6988

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

1

632.90010

632.90010

84.01

<.0001

Error

29

218.48144

7.53384

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

82.42177

3.85530

3443.36654

457.05

<.0001

RunTime

-3.31056

0.36119

632.90010

84.01

<.0001

Bounds on condition number: 1, 1

Forward Selection: Step 2

Variable Age Entered: R-Square = 0.7642 and C(p) = 12.3894

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

2

650.66573

325.33287

45.38

<.0001

Error

28

200.71581

7.16842

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

88.46229

5.37264

1943.41071

271.11

<.0001

Age

-0.15037

0.09551

17.76563

2.48

0.1267

RunTime

-3.20395

0.35877

571.67751

79.75

<.0001

Bounds on condition number: 1.0369, 4.1478

Forward Selection: Step 3

Variable RunPulse Entered: R-Square = 0.8111 and C(p) = 6.9596

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

3

690.55086

230.18362

38.64

<.0001

Error

27

160.83069

5.95669

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

111.71806

10.23509

709.69014

119.14

<.0001

Age

-0.25640

0.09623

42.28867

7.10

0.0129

RunTime

-2.82538

0.35828

370.43529

62.19

<.0001

RunPulse

-0.13091

0.05059

39.88512

6.70

0.0154

Bounds on condition number: 1.3548, 11.597

Forward Selection: Step 4

Variable MaxPulse Entered: R-Square = 0.8368 and C(p) = 4.8800

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

4

712.45153

178.11288

33.33

<.0001

Error

26

138.93002

5.34346

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

98.14789

11.78569

370.57373

69.35

<.0001

Age

-0.19773

0.09564

22.84231

4.27

0.0488

RunTime

-2.76758

0.34054

352.93570

66.05

<.0001

RunPulse

-0.34811

0.11750

46.90089

8.78

0.0064

MaxPulse

0.27051

0.13362

21.90067

4.10

0.0533

Bounds on condition number: 8.4182, 76.851

Forward Selection: Step 5

Variable Weight Entered: R-Square = 0.8480 and C(p) = 5.1063

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

5

721.97309

144.39462

27.90

<.0001

Error

25

129.40845

5.17634

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

102.20428

11.97929

376.78935

72.79

<.0001

Age

-0.21962

0.09550

27.37429

5.29

0.0301

Weight

-0.07230

0.05331

9.52157

1.84

0.1871

RunTime

-2.68252

0.34099

320.35968

61.89

<.0001

RunPulse

-0.37340

0.11714

52.59624

10.16

0.0038

MaxPulse

0.30491

0.13394

26.82640

5.18

0.0316

Bounds on condition number: 8.7312, 104.83

The final variable available to add to the model, RestPulse, is not added since it does not meet the 50% (the default value of the SLE option is 0.5 for FORWARD selection) significance-level criterion for entry into the model.

The BACKWARD model-selection method begins with the full model. Output 76.2.2 shows the steps of the BACKWARD method. RestPulse is the first variable deleted, followed by Weight. No other variables are deleted from the model since the variables remaining (Age, RunTime, RunPulse, and MaxPulse) are all significant at the 10% (the default value of the SLS option is 0.1 for the BACKWARD elimination method) significance level.

Output 76.2.2
Backward Selection Method: PROC REG

Backward Elimination: Step 0

All Variables Entered: R-Square = 0.8487 and C(p) = 7.0000

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

6

722.54361

120.42393

22.43

<.0001

Error

24

128.83794

5.36825

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

102.93448

12.40326

369.72831

68.87

<.0001

Age

-0.22697

0.09984

27.74577

5.17

0.0322

Weight

-0.07418

0.05459

9.91059

1.85

0.1869

RunTime

-2.62865

0.38456

250.82210

46.72

<.0001

RunPulse

-0.36963

0.11985

51.05806

9.51

0.0051

RestPulse

-0.02153

0.06605

0.57051

0.11

0.7473

MaxPulse

0.30322

0.13650

26.49142

4.93

0.0360

Bounds on condition number: 8.7438, 137.13

Backward Elimination: Step 1

Variable RestPulse Removed: R-Square = 0.8480 and C(p) = 5.1063

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

5

721.97309

144.39462

27.90

<.0001

Error

25

129.40845

5.17634

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

102.20428

11.97929

376.78935

72.79

<.0001

Age

-0.21962

0.09550

27.37429

5.29

0.0301

Weight

-0.07230

0.05331

9.52157

1.84

0.1871

RunTime

-2.68252

0.34099

320.35968

61.89

<.0001

RunPulse

-0.37340

0.11714

52.59624

10.16

0.0038

MaxPulse

0.30491

0.13394

26.82640

5.18

0.0316

Bounds on condition number: 8.7312, 104.83

Backward Elimination: Step 2

Variable Weight Removed: R-Square = 0.8368 and C(p) = 4.8800

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

4

712.45153

178.11288

33.33

<.0001

Error

26

138.93002

5.34346

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

98.14789

11.78569

370.57373

69.35

<.0001

Age

-0.19773

0.09564

22.84231

4.27

0.0488

RunTime

-2.76758

0.34054

352.93570

66.05

<.0001

RunPulse

-0.34811

0.11750

46.90089

8.78

0.0064

MaxPulse

0.27051

0.13362

21.90067

4.10

0.0533

Bounds on condition number: 8.4182, 76.851

The MAXR method tries to find the "best" one-variable model, the "best" two-variable model, and so on. Output 76.2.3 shows that the one-variable model contains RunTime; the two-variable model contains RunTime and Age; the three-variable model contains RunTime, Age, and RunPulse; the four-variable model contains Age, RunTime, RunPulse, and MaxPulse; the five-variable model contains Age, Weight, RunTime, RunPulse, and MaxPulse; and finally, the six-variable model contains all the variables in the MODEL statement.

Output 76.2.3
Maximum R-Square Improvement Selection Method: PROC REG

Maximum R-Square Improvement: Step 1

Variable RunTime Entered: R-Square = 0.7434 and C(p) = 13.6988

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

1

632.90010

632.90010

84.01

<.0001

Error

29

218.48144

7.53384

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

82.42177

3.85530

3443.36654

457.05

<.0001

RunTime

-3.31056

0.36119

632.90010

84.01

<.0001

Bounds on condition number: 1, 1

The above model is the best 1-variable model found.

Maximum R-Square Improvement: Step 2

Variable Age Entered: R-Square = 0.7642 and C(p) = 12.3894

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

2

650.66573

325.33287

45.38

<.0001

Error

28

200.71581

7.16842

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

88.46229

5.37264

1943.41071

271.11

<.0001

Age

-0.15037

0.09551

17.76563

2.48

0.1267

RunTime

-3.20395

0.35877

571.67751

79.75

<.0001

Bounds on condition number: 1.0369, 4.1478

The above model is the best 2-variable model found.

Maximum R-Square Improvement: Step 3

Variable RunPulse Entered: R-Square = 0.8111 and C(p) = 6.9596

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

3

690.55086

230.18362

38.64

<.0001

Error

27

160.83069

5.95669

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

111.71806

10.23509

709.69014

119.14

<.0001

Age

-0.25640

0.09623

42.28867

7.10

0.0129

RunTime

-2.82538

0.35828

370.43529

62.19

<.0001

RunPulse

-0.13091

0.05059

39.88512

6.70

0.0154

Bounds on condition number: 1.3548, 11.597

The above model is the best 3-variable model found.

Maximum R-Square Improvement: Step 4

Variable MaxPulse Entered: R-Square = 0.8368 and C(p) = 4.8800

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

4

712.45153

178.11288

33.33

<.0001

Error

26

138.93002

5.34346

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

98.14789

11.78569

370.57373

69.35

<.0001

Age

-0.19773

0.09564

22.84231

4.27

0.0488

RunTime

-2.76758

0.34054

352.93570

66.05

<.0001

RunPulse

-0.34811

0.11750

46.90089

8.78

0.0064

MaxPulse

0.27051

0.13362

21.90067

4.10

0.0533

Bounds on condition number: 8.4182, 76.851

The above model is the best 4-variable model found.

Maximum R-Square Improvement: Step 5

Variable Weight Entered: R-Square = 0.8480 and C(p) = 5.1063

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

5

721.97309

144.39462

27.90

<.0001

Error

25

129.40845

5.17634

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

102.20428

11.97929

376.78935

72.79

<.0001

Age

-0.21962

0.09550

27.37429

5.29

0.0301

Weight

-0.07230

0.05331

9.52157

1.84

0.1871

RunTime

-2.68252

0.34099

320.35968

61.89

<.0001

RunPulse

-0.37340

0.11714

52.59624

10.16

0.0038

MaxPulse

0.30491

0.13394

26.82640

5.18

0.0316

Bounds on condition number: 8.7312, 104.83

The above model is the best 5-variable model found.

Maximum R-Square Improvement: Step 6

Variable RestPulse Entered: R-Square = 0.8487 and C(p) = 7.0000

Analysis of Variance

Source

DF

Sum ofSquares

MeanSquare

F Value

Pr > F

Model

6

722.54361

120.42393

22.43

<.0001

Error

24

128.83794

5.36825

Corrected Total

30

851.38154

Variable

ParameterEstimate

StandardError

Type II SS

F Value

Pr > F

Intercept

102.93448

12.40326

369.72831

68.87

<.0001

Age

-0.22697

0.09984

27.74577

5.17

0.0322

Weight

-0.07418

0.05459

9.91059

1.85

0.1869

RunTime

-2.62865

0.38456

250.82210

46.72

<.0001

RunPulse

-0.36963

0.11985

51.05806

9.51

0.0051

RestPulse

-0.02153

0.06605

0.57051

0.11

0.7473

MaxPulse

0.30322

0.13650

26.49142

4.93

0.0360

Bounds on condition number: 8.7438, 137.13

Note that for all three of these methods, RestPulse contributes least to the model. In the case of forward selection, it is not added to the model. In the case of backward selection, it is the first variable to be removed from the model. In the case of MAXR selection, RestPulse is included only for the full model.

For the STEPWISE, BACKWARD, and FORWARD selection methods, you can control the amount of detail displayed by using the DETAILS option, and you can use ODS Graphics to produce plots that show how selection criteria progress as the selection proceeds. For example, the following statements display only the selection summary table for the FORWARD selection method (Output 76.2.4) and produce the plots shown in Output 76.2.5 and Output 76.2.6.

Output 76.2.5 show how six fit criteria progress as the forward selection proceeds. The step at which each criterion achieves its best value is indicated. For example, the BIC criterion achieves its minimum value for the model at step 4. Note that this does not mean that the model at step 4 achieves the smallest BIC criterion among all possible models that use a subset of the regressors; the model at step 4 yields the smallest BIC statistic among the models at each step of the forward selection. Output 76.2.6 show the progression of the SBC statistic in its own plot. If you want to see six of the selection criteria in individual plots, you can specify the UNPACK suboption of the PLOTS=CRITERIA option in the PROC REG statement.

Output 76.2.5
Fit Criteria

Output 76.2.6
SBC Criterion

Next, the RSQUARE model-selection method is used to request and statistics for all possible combinations of the six independent variables. The following statements produce Output 76.2.7:

The models in Output 76.2.7 are arranged first by the number of variables in the model and then by the magnitude of for the model.

Output 76.2.8 shows the panel of fit criteria for the RSQUARE selection method. The best models (based on the R-square statistic) for each subset size are indicated on the plots. The LABEL suboption specifies that these models are labeled by the model number that appears in the summary table shown in Output 76.2.7.

Output 76.2.8
Fit Criteria

Output 76.2.9 shows the plot of the criterion by number of regressors in the model. Useful reference lines suggested by Mallows (1973) and Hocking (1976) are included on the plot. However, because all possible subset models are included on this plot, the better models are all compressed near the bottom of the plot.

Output 76.2.9
Criterion

The following statements use the BEST=20 option in the model statement and SELECTION=CP to restrict attention to the models that yield the 20 smallest values of the statistic:

Output 76.2.10 shows the summary table listing the regressors in the 20 models that yield the smallest values, and Output 76.2.11 presents the results graphically. Reference lines and are shown on this plot. See the PLOTS=CP option for interpretations of these lines. For the Fitness data, these lines indicate that a six-variable model is a reasonable choice for doing parameter estimation, while a five-variable model might be suitable for doing prediction.

Output 76.2.10
Selection Summary: PROC REG

The REG Procedure

Model: MODEL1

Dependent Variable: Oxygen

C(p) Selection Method

ModelIndex

Number inModel

C(p)

R-Square

Variables in Model

1

4

4.8800

0.8368

Age RunTime RunPulse MaxPulse

2

5

5.1063

0.8480

Age Weight RunTime RunPulse MaxPulse

3

5

6.8461

0.8370

Age RunTime RunPulse RestPulse MaxPulse

4

3

6.9596

0.8111

Age RunTime RunPulse

5

6

7.0000

0.8487

Age Weight RunTime RunPulse RestPulse MaxPulse

6

3

7.1350

0.8100

RunTime RunPulse MaxPulse

7

4

8.1035

0.8165

Age Weight RunTime RunPulse

8

4

8.2056

0.8158

Weight RunTime RunPulse MaxPulse

9

4

8.8683

0.8117

Age RunTime RunPulse RestPulse

10

4

9.0697

0.8104

RunTime RunPulse RestPulse MaxPulse

11

5

9.9348

0.8176

Age Weight RunTime RunPulse RestPulse

12

5

10.1685

0.8161

Weight RunTime RunPulse RestPulse MaxPulse

13

3

11.6167

0.7817

Age RunTime MaxPulse

14

2

12.3894

0.7642

Age RunTime

15

2

12.8372

0.7614

RunTime RunPulse

16

4

12.9039

0.7862

Age Weight RunTime MaxPulse

17

3

13.3453

0.7708

Age Weight RunTime

18

4

13.3468

0.7834

Age RunTime RestPulse MaxPulse

19

1

13.6988

0.7434

RunTime

20

3

13.8974

0.7673

Age RunTime RestPulse

Output 76.2.11
Criterion

Before making a final decision about which model to use, you would want to perform collinearity diagnostics. Note that, since many different models have been fit and the choice of a final model is based on , the statistics are biased and the -values for the parameter estimates are not valid.