Figure 8.

Performance of the machine learning method trained on different sized sets of data
from SAAPdb. In each case, a balanced dataset of the required size was extracted at random from
the SAAPdb dataset of mutations mapped to protein chains (Table 2) and random forests
were trained and tested using 10-fold cross-validation. The graph clearly shows that
performance drops as the dataset size decreases, showing a marked drop in performance
with datasets below 10,000 samples in size (5,000 SNPs and 5,000 PDs).