I am new to data-mining and I found Orange Canvas while searching the internet about data-miners. I found Orange Canvas to be an excellent data-mining platform for beginners. Keep the good work going!!

The problem is that I've started playing with Orange Canvas; trying to classify some classic datasets (e.g. Iris dataset etc.). So I used a Classification Tree algorithm to classify the dataset. All goes right up till this classification. Then I wanted to view the Classification Tree that I had just built. So I tried to add a "Classification Tree Viewer" after the Classification Tree. The problem is that I can't add the "Classification Tree Viewer". When I click on the Classification Tree Viewer Tab, instead of the Viewer getting in the canvas, a new window opens with the following error:

Classification Tree Viewer 2D, which has an unfortunate name which will change to Classification Tree Graph, is a graphical presentation of the classification tree, whereas Classification Tree Viewer has list-based presentation (something like an File Explorer in Windows). I could not replicate your error, it may be a bug that was recently fixed - you may consider downloading the latest snapshot for MS Windows (http://www.ailab.si/orange/download/orange-snapshot.exe).

We have just started to write the documentation for widgets. Most of it is still empty, but as I turns out, the one for Classification Tree Graph (that is, Classification Tree Viewer 2D) is there already:http://www.ailab.si/orange/doc/widgets/catalog/Classify/ClassificationTreeGraph.htm

I just downloaded the latest Orange snapshot available on the website and it's still not working. It might be because I don't have administrator privileges on the computer I installed Orange. I remember an error while installing Python, but the installation finished well.

I distinctly remember commiting the file to the CVS but it seems I remember wrong. I did it now and ran the script that builds the snapshot. You can download and try it, just make sure that the browser doesn't give you the cached file.

Administrative privileges are needed for some Python stuff, but it shouldn't matter to Orange.

Regarding c45: I agree that this is really really annoying, but we just didn't get any response when we asked the Univ of New South Wales for the permission to distribute the binary files. Send me an email to janez [dot ]demsar /AT/ fri.uni-lj.si and I'll send you the file.

You guys just rock!! Thanks you very much again. Everything is working just perfectly (including Interactive Trees & Classification Tree Viewer).

It took me quite some time to compile C45.dll, but finally I manged to do it.

I don't understand why Univ of South Wales isn't willing to share the binaries with you; it might be because RuleQuest is now commercializing C5.0. (To be honest, and it's my personal opinion, I don't see quite a lot of difference between C4.5 and C5.0 apart from the processing speed).

Anyway, thanks again and keep the good work going.

Finally one last question (and I know it's really stupid). I've been able to classify the Iris dataset using all the algorithms available in the Classify tab and it's working just fine. Now I just wanted to classify a new set from the model I built. Should I use the Classification widget in the Evaluate tab to do automatic classification on a new set?

Feed the data from a file widget to one or more learners and connect all learners to the "Classifications" widget. Classifications will thus get a bunch of models. Now take a new file widget, load some new data and feed it to Classifications. So Classifications will show you the predictions of all models for examples from the second file widget.

I just connected a file widget to a data sampler widget. I then connected the data-sampler wideget to various learners (Naive Bayes, Classification Trees, SVM, CN2 etc.). I finally connected the output of all learner widgets to the Classifications widget. For this try I connected the File widget output to the Classifications widget as well.

For this case, I used the Iris database. The classifications went well and the Classification widget shows me the true class with the predicted class of all the widgets. Just perfect!

However, I've a few questions regarding the test file.

1) I understand that the test file needs to have the same columns as the training file. So does the test file need to have exactly the same characteristics as the training life?

2) Secondly, what about the first three rows (first line: attribute names line; second line: types of attributes; third line: flags)? Do I need to have the first three lines for the test file as well?

3) Now this is stupid (I hope you would forgive me); but can I leave the target column out of the test file? So for the Iris test file, I would just have a file with four columns and no target column.

4) Finally, can I make a XML file of the model? (Just wondering)

Hope you would help me out here. And I apologize again for these really stupid questions, but I just want to clarify a few ideas.

There is a workaround. Say you have a dataset with a complete set of attributes and another file in which some of them are missing. Open the first data set in File Widget and the second data set in "Extended File Widget" (it's just next to the File widget in the toolbar). If you connect the two widgets, the second one will use the same attributes as the former for as long as they have the same name and type.

But I never tried what happens in Classifications if the class is missing in the test dataset. Now I did and: Classifications throws an exception. I wrote that bug down, so we'll fix it some day.

There was a widget for saving a naive Bayes model in XML, but we temporarily removed it since it was useless due to certain canvas limitations. We had nothing of the kind for any other model, but we shall surely at this in the future - soon, I hope.

Thank you very much for such a detailed reply. Needless to say, I really appreciate it.

I guess that means I can't use Orange Canvas for prediction purposes once a model has been created on the training set. I was thinking that I could put random classes in unseen test file and let Orange do the classifications.

Thank you very much for your reply. I totally agree with you about neural networks; for me they are black-boxes and who has to really trust them to use them (and at the same time they aren't really intuitive).

One small question: I am trying to use CN2 algorithm and unfortunately the catalog fopr this widget isn't ready yet. That's why I am bothering you yet again with another question. For getting to know Orange, I've chosen a slightly difficult dataset with 30 variables and 1200 example sets. I just plugged in the CN2 widget and it gave me pretty good results. Now when I opened up the Properties window of CN2 widget, I discovered that max rule length has been set to zero by default! Although in the CN2 results I do see rule lengths of 4. How's that possible if max rule length is set to zero?

And finally to do a very exhautive search of rules using all my 30 variables, do you agree that I should put rule length very high (let's say 99) and at the same time reduce beamwidth (let's say 1 or 2)? Any advice would be highly appreciated as I would really like to know what settings I need to put to get a very exhaustive rule search.

I am responsible for CN2 widget and algorithm so it might be best that I answer you. If max rule length is set to zero, this means that the max. allowed length of a rule is infinite. I agree that it is a bit confusing and hence I shall add a checkbox for specfying whether rule length should be restricted or not.

Regarding your second question (about the most exhaustive search) there are several possibilities. Setting mininum coverage (min. number of examples that a rule must cover) and max. rule length define how long the learner will be adding conditions, where the most exhaustive variation is thus setting both to zero. The second option is to use higher beamwidth - it will try more rules at each rule length. And the last option is to select weighted covering, which only partially removes examples and allows the algorithm to learn more rules describing the examples with different patterns.

The selection of one of those strategies depends from data used. If you suspect that concepts in data are described with longer rules, you should use the first variant, otherwise you might want to go with the second. Then again, if you believe that there are many "dependent" concepts you should probably also try the third option.