Tuesday, November 13, 2012

1. For the first release of a new project, how can we learn quality prediction models on cross-project data?

Burak proposed to use NN-filter to help cross-project defect prediction and made some promising results (JASE 2010). Now Fayola is also working on this and do it much better (mostly on small size test sets). There are so much worth being further investigated.

1.1 can we do better on cross-learning performance ?

-- some more effective ways to understand the training and test data, like TEAK?

1) at the time dimensionality: for multi-version projects, historical data of early releases may not applicable for new release.

2) at the space dimensionality: because of the data lack problem, we need to do cross-project defect prediction in some cases.

We see transfer learning as a potential solution to data shift problem of defect prediction for its capability to learn and predict under different training and test distribution. Our on going experiments support this argument.

First, we observe serious performance reduction on cross-release and cross-project defect prediction (i.e., at the time dimensionality and the space dimensionality ).

Wednesday, November 7, 2012

It produces disjoint clusters of various dimensionality and does not suffer from the exhaustive subspace search problem of bottom-up approaches. It finds the densest, correlated attributes and then searches the points for patterns (clusters).

As with most subspace and projected clustering algorithms, the clustering is done in a cyclical manner. In this method, FP Growth is used to find an candidate attribute subset, then EM clustering is performed over the attributes. EM produces multiple clusters, which are tested by classification learners. Good clusters are labeled and removed from the data set, creating disjoint instance clusters. The null and bad clusters remain in the data set for further cycles of clustering. All attributes are available for the FP Growth step and may be repeated in later clusters.

This method requires several parameters, for FP Growth, EM clustering, minimum test and stopping criteria. I believe that it will be significantly less sensitive to the parameter values than current methods. I also believe that it will be more computationally efficient than existing techniques since it uses FP Growth to find candidate subspaces escaping a combinatorial search, and removes clustered instances reducing the overall data set size.

Literature surveys compare the performance of methods over varying data set sizes. Moise09 varies attribute size and instance size to create several data sets, but does not test against data sets with roughly the same attribute and instance size. My method was designed for this case, named the n x m problem. Other subspace and projected clustering algorithms suffer as the attributes are increased; whereas my method is suited to scalability.

In Advances in NIPS, volume 17. MIT Press, 2005There is a consensus in the high-dimensional data analysis community that the onlyreason any methods work in very high dimensions is that, in fact, the data are nottruly high-dimensional. Rather, they are embedded in a high-dimensional space,but can be e±ciently summarized in a space of a much lower dimension, such as anonlinear manifold. Then one can reduce dimension without losing much informa-tion for many types of real-life high-dimensional data, such as images, and avoidmany of the \curses of dimensionality".