It seems like Orange is the perfect package to construct such a model, but I'm currently inexperienced. The problem I'm trying to solve is outlined here (the first project). Does this sound like the right approach?

The idea is you can't make any good statistical inference from one decision tree. Random Forests works by splitting the data using majority vote decision tree processes. It does that lots of times, thousands if you want, and it gets more precise the more it does that. Then it tells you which independent variables are the most valuable predictors of your dependent variable. In Knoware's case, you don't know what the problem is or how to fix it, so the problem is the dependent variable. Also, it is a means to evaluate groups, so you can see if a certain kind of problem clumps with other factors.

Here is the code for Random Forests I am about to integrate into orngEnsamble module (once I finish the documentation :-); also thanks to Janez and Minca from which I took the code they used for RF-based feature subset selection):

def __call__(self, examples, weight=0): # if number of attributes for subset is not set, use square root if hasattr(self.learner.split, 'attributes') and not self.learner.split.attributes: self.learner.split.attributes = int(sqrt(len(examples.domain.attributes)))

This is a Random Forest algorithm that closely follows a proposal by Brieman (Machine Learning, 2001). It constructs a number of classification trees for which gini index is used for attribute scoring, and where attributes from each node are chosen from an attribute subset. Each tree is developed from a bootstrap sample of the training data.

The defaults are: 100 trees, where for each node an attribute is chosen from a randomly picked subset of size equal to square root of number of attributes in the training data. The above code is general to the point where you can change all this, as you can change the tree learner you wish to use for growing of the forest.

The principal trick in the code is SplitConstructor_AttributeSubset, which replaces a splitting function used in Orange's tree induction. The trick is that this function is the same as original function implemented in C, except that is using a subset insead of complete set of attributes. This is also a nice example how a component-based architecture can help in prototyping a new methods (prototype a component in Python and then use it with algorithm that was developed in C).

When presented an example, Forest Trees use simple voting (all classifiers being equal). When predicting class probabilities, the above code uses average probability predictied from a set of classifiers (see also a note just in RandomForestClassifier).

Here is the code that tests the above and compares RF to a single tree on a single data set: