from UCI [14] (Waveform, Grey-Landsat, Letter-Two, Letter-Two with added noise, Spam, Musk and the P2 synthetic data set 1 . We achieved a characterization of the bias--variance decomposition of the error in bagged and random aggregated ensembles that resembles the one obtained for single SVMs [5] (Fig. 1. For more

with 4601 instances and 57 continuous attributes. 4.1.1.6 Musk The dataset (available from UCI) describes a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be nonmusks. The 166 features that describe

bags, and the number of instances contained in each bag ranges from 1 to 1,044. Detailed information on the Musk data is tabulated in Table 2. Ten-fold cross validation is performed on each Musk data set. In each fold, Bagging is employed to build an ensemble for each of the four base multi-instance learners, i.e. Iterated-discrim APR, Diverse Density, Citation-kNN, and EM-DD. Each ensemble

maintain consistency with reported results (Quinlan, 1996). For Satimage, we used the original division into a training and test set, so the results represent one run of each algorithm. For the Musk dataset, which has 166 features, FSS and BSS took too long to run (over 24 hours for a single trial) and no results were obtained. 3.2 ACCURACY The accuracy and parameter selection results (average k or

to an interval test in the discrete domain. The three approaches have been used and compared in our experiments. 3.2 Experimental Evaluation We evaluate the effect of discretization on two datasets: the Musk dataset (available at the UCI repository [11]) and the Diterpene dataset, generously provided to us by Steffen Schulze-Kremer and Saso Dzeroski. Both datasets contain nondeterminate

Dougherty's [DKS95] work, and as such is capable of handling numerical data. For more details see [VLDDR96, BDR97b]. 5 Experimental Evaluation Experiments have been performed on several benchmark datasets: Mutagenesis [SMSK96], Musk [DLLP97, MM96], and Diterpenes [DSKH + 96]. For all the experiments, Tilde's default parameters were used; only the choice of the number of thresholds for discretization

and relational learning. 24 7 Discussion The most serious problem encumbering the advance of multi-instance learning is that there is only one popularly used real-world benchmark data, i.e. the Musk data sets. Although some application data have been used in some works, they can hardly act as benchmarks for some reasons. For example, the COREL image database has been used by Maron and Ratan [18], Yang

BP-MIP network is used in prediction, a bag is positively labeled if and only if the output of the network on at least one of its instances is not less than 0.5. 6 5. Experiments 5.1 Real-world data sets The Musk data is the only real-world benchmark test data for multi-instance learning at present. The data is generated by Dietterich et al. in the way described in Section 2. There are two data