We evaluated various algorithms over the standard 25
benchmark datasets used by Friedman et al. [FGG97]: 23 from UCI
repository [BM00], plus “mofn-3-7-10” and “corral”, which were developed by
[KJ97] to study feature selection. Also the same 5-fold
cross validation and Train/Test learning schemas are followed (see
following Table). As part of data preparation, continuous
data are always discretized using the supervised entropy-based approach.
(You can download the discretized data
here.)

Gradient based learners have to determine when
to stop climbing. A naive implementation would climb for a fixed pre-set number
of iterations, or would continue climbing as long as the empirical accuracy is
increasing. Our empirical studies (on both ELR and APN) show that these
approaches are problematic, as these systems will typically overfit or underfit.
To demonstrate this, we present 5-fold cross
validation learning curves from TAN+ELR training results on the cleve
dataset. For each cross validation run, we used a performed 20 iterations over
the training data; and we plotted the 'Resubstitution Error' and 'Generalization
Error' after each gradient descent iteration (See graphs below.) The
'Generalization Error' is the testing error of the resulting system on the
hold-out fold after each training iteration. (I.e., we divided all cleve
data into 5 fold: {F1, F2, F3, F4, F5}; in each iteration of the first cross
validation run, we used F1+F2+F3+F4 for training, then evaluated the resulting
system against the F5 hold-out testing data to produce the 'Generalization
Error'). Many of the plots show that ELR's gradient ascent
starts overfitting significantly only after
a few training iterations.

Based
on the generalization error plots, we see that ELR should stop after
{2, 1, 1, 4, 5} iterations, for these 5 cross validation
runs. Of course, ELR will not know these "optimal iteration numbers"
as they are based on the hold-out data, which is NOT available at training time.

Fortunately, ELR estimates these
numbers from the available training data, using a standard method we call
"cross-tuning", described on pages 9-10 of the
manuscript, to try to identify the number of climbs (iterations) that is
appropriate for each specific dataset. Cross-tuning first splits the training
set into n parts (folds), then successively trains on n-1 folds
and evaluates on the remaining one. In particular, for each instance, it runs
the ELR algorithm on n-1 folds for a large number of iterations, and
measures the quality of the resulting classifier on the other fold. For each
run, it determines which iteration produces the smallest generalization error.
Cross-tuning then picks the median value m over these runs. Later, when
running on the full dataset (all n folds), it will run for m
iterations before stopping.

The paired t-tests of ELR results on the UCI benchmark
datasets shows that cross-tuning is essential in ELR learning:

NB+ELR(+xt) <-- NB+ELR(-xt) (p < 0.3) (slightly better),
while

TAN+ELR(+xt) <-- TAN+ELR(-xt) (p < 0.05) (statistically
better).

(Recall 'x <- y' means 'x is better than y'.)

Here NB+ELR(-xt) is comparable to TAN+ELR(-xt),
whose performance was significantly degraded by overfitting. This shows
cross-tuning can be effective to prevent overfitting especially when learning
parameters of complex BN structures.

The obvious downside of cross-tuning, of course,
is computation expense; see timing
information.

To demonstrate how cross-tuning works to help
avoid overfitting, we revisit the
experiments on the cleve dataset.
For the first cross validation run, we split the training data from folds
{F1,F2, F3, F4} into another 5 folds for cross-tuning; call them 1CT = {1CT1,
1CT2, ..., 1CT5}. (Note: F1 + F2 + F3 + F4 = 1CT1 + 1CT2 + ... + 1CT5.) We then
ran 5 fold cross-tuning on 1CT, here by using 4 folds of 1CT for training and
the remaining 1CT fold for testing, over 20 iterations. Each cross-tuning run
determined an iteration number that produced the smallest testing error on the
hold-out 1CT fold. After 5-fold cross-tuning runs, we took the median value of
the 5 estimates and used it as the iteration number in the training on the full
1CT set.

For this first cross-validation run, this produced an estimate of 2,
which we see (from the "cleve fold 1/5" graph below) is correct. We
similarly computed this quantity for the other four cross-validation scenarios,
producing {2, 1, 1, 3, 5} respectively for the 5 cross validation runs. Notice
cross-tuning identified the correct stopping number in 4 of the 5 cross
validation run. The only exception is the fourth one, where it returned 3, not
4.

The page
summarizes all the results on complete data experiments from various papers. In
short, we found that x+ELR performed comparably to C4.5 and SNB.
The following table summarizes our results when comparing ELR vs SVM-light.
(Note that we only ran over the datasets with BINARY class labels.) The
page
presents further details on these SVM experiments.

Data set

NB+ELR

TAN+ELR

GBN+ELR

svm-light c0.05 t1 d2 *

svm-light best value

australian

84.93

1.06

84.93

1.03

86.81

1.11

70.29

9.11

77.10

2.88

breast

96.32

0.66

96.32

0.70

95.74

0.43

93.97

1.21

96.62

1.23

chess

95.40

0.64

97.19

0.51

90.06

0.92

97.65

0.00

98.97

0.00

cleve

81.36

2.46

81.36

1.78

82.03

1.83

72.54

4.39

80.34

3.08

corral

86.40

3.25

100.00

0.00

100.00

0.00

96.80

5.22

100.00

0.00

crx

86.46

1.85

86.15

1.70

85.69

1.30

70.15

8.34

70.31

6.43

diabetes

75.16

1.39

73.33

1.97

76.34

1.30

69.28

5.77

76.34

3.50

flare

82.82

1.35

83.10

1.29

82.63

1.28

82.06

3.81

82.91

3.13

german

74.60

0.58

73.50

0.84

73.70

0.68

66.20

1.75

68.70

5.75

glass2

81.88

3.62

80.00

3.90

78.75

3.34

79.37

8.45

79.37

8.45

heart

78.89

4.08

78.15

3.86

78.89

4.17

76.67

2.81

83.33

3.21

hepatitis

86.25

5.38

85.00

5.08

90.00

4.24

86.25

5.23

86.25

5.23

mofn-3-7-10

100.00

0.00

100.00

0.00

100.00

0.00

100.00

0.00

100.00

0.00

pima

75.16

2.48

74.38

2.58

74.25

2.53

70.59

4.03

75.95

2.03

vote

95.86

0.78

95.40

0.63

95.86

0.78

93.10

1.15

95.17

1.50

average

85.43

85.92

86.05

81.66

84.76

*
We tried many settings, and found this specific setting, [c=0.05, poly 2
(t=1, d=2)] produced the best average for SVM. (As this is based on ALL
data, this does give svm-light a slight advantage.)

Finally, our companion paper [GGS97]
also considers learning the parameters of a given structure towards optimizing
performance on a distribution of queries. Our results here differ, as we are
considering a different learning model: [GGS97]
tries to minimize the squared-error score, a variant of Equation 9that is based on two different types of samples --- one over
tuples, to estimate P(C | E), and the other over queries, to
estimate the probability of seeing each ``What is P(C | E = e)?''
query. By contrast, the current paper tries to minimize classification error
(Equation 3) by seeking the optimal ``conditional likelihood'' score (Equation 4), wrt a single sample of labeled instances. Moreover, our current paper
includes new theoretical results, a different algorithm, and completely new
empirical data.

PROOFS of the claims (Article 1 above simply stated the claims; n.b.,
the proofs are non-trivial)

many more EXPERIMENTS --

complete series of results on G<T, G~T, G>T
(which is completely new),
for both complete and incomplete data,
comparison not only within different BN structures (NB/TAN/GBN), but also
with other parameter learning algorithms OFE/EM/APN, etc.