tab-delimited data format is intuitional and easy to handle. but for some special cases, maybe a sparse representation is more decent. for example, text classification maybe thousands of feature, and only subsection is non-zero. for example, libsvm use the format below,0.0 1:4.236298e+00 2:2.198210e+01 3:-3.503797e-01 4:9.752163e+01Translate from libsvm like format into orange's tab-delimited is not a big problem with python, but maybe a lot of "zero" will be stored to disk...

and another question. i want to use orange for a text classfication. what about the scalabity of it.will it be qualified for several thousands training sample and several thousands features. or, several thousands training sample and several hundreds features? i am doing preprocess tonight, maybe i can answer it myself tommorrow.

orange does not have a specialized data presentation for text mining, like the one you describe above. perhaps closest is its basket format:http://www.ailab.si/orange/doc/reference/basket.htm

matrices of several thousand x several thousand should in principal not be a problem. we are recently working with those for cancer microarray data that may have several hundred columns and several tens of thousand rows. still, let us know if you encounter problems with your particular data sets.

thanks for the help.now 6000(samples) * 100(features)knn needs 10 seconds per learning.and svm needs about 1 minutes per learningbayes in still in learing. much more slowly than knn and svm...----today, bayes needs 20388 seconds.just the simplest script as belowis BayesLearner written in pure python? so is it much slower than svm which is written in a combination of C++ and python.

Naive bayes classifier is written in pure C++ (nothing in Python). Your features are most probably continuous - for these, NBC in Orange uses LOESS approximation to estimate probabilities (see http://www.ailab.si/orange/doc/referenc ... earner.htm). if you discretize your data NBC should be the fastest of all learners.