10/16/14 - Summary results, 3 frontiers, and F-measure rankings.

The rig used for these results is different (simplified) in the following ways:-RF only-One parameter (maxDepth) swept from 0 to 16-All other parameters fixed-Three ND frontiers are returned from tuning instead of one.-Due to multiple frontiers, pD/pF AUC is out. F-measure is used as ranking field instead.

New Results Format

In this format, the top item is a comparison of the prev and current version dataset stats. These are the usual suspects plus "overlapping instances" where the software module name is the same in both versions and "identical instances" where the software module's metrics are unchanged from one version to the next.

The rest of the chart shows the results of parameter tuning on both the previous and current versions. There is a table for each learner which lists all of its explored parameter values and the frequency with which they were selected by the grid search in both the previous and current versions. For example, a parameter value of "False" appearing in 90% of the selected combinations in the previous version and 43% of the current version selections would be represented as "False: (90/43)".

It also shows the pD/pF performance of each learner's non-dominated turnings applied in-set and out of set. in this case we have four combinations:

tune on prev -> apply in-version

tune on prev -> apply out-of-version (current)

tune on current -> apply in-version

tune on current -> apply out-of-version (prev)

In this case, the major effect that we see is that the green sticks with the blue and the red sticks with the purple. This scenario arises when one dataset is more difficult to preform well on than the other. Beyond that, the performance in-version and out-of version seem pretty comparable. There are occasional exceptions, but not a real trend towards in-version or out-of-version doing better.