I'm Sumith, a postgraduate research student in the area of Data Mining and Text Clustering techniques. I'm basically research on the area of SOM based algorithms and Text Clustering. I would like to suggest two ideas to improve Orange's capabilities in the areas of above mentioned.

Idea 1 : In SOM clustering, user has to specify the map size. To address the issues with this pre-fix static architecture of SOM, Growing Self Organizing Map(GSOM) has been proposed and has been widely used in many different domains including Bio Informatics and Text Mining, etc. Since I am working with SOM and GSOM, I think implementing GSOM in Orange would be really advantageous to a lot of people using the tool.

Idea 2 : Including Text pre-processing module in Orange. I think having a text pre-processing module with Orange would really enhance its use among the text mining researchers. Different frequency based text weighting schemes can be implemented with the tool which produced a file which can be directly fed into any of the Clustering algorithms.

These are my general ideas to improve Orange's capabilities as a data mining tool. I am really happy to know your input on the above ideas, and its suitability with the GSOC 2011. Since I have worked with both of the above two areas, I do really have a strong theoretical and practical background with their implementation. Please let me know your ideas on this.

Personally, I like both ideas. We were thinking of adding the text mining idea to the collation of ideas anyway. The text mining module is a bit old and it has not been touched for a while, so anything to improve it would be most welcome.

I would suggest that you could go with whichever ideas of the two you like. Both would be welcome additions to Orange.

I have found following suggestions about the improvements to Orange text clustering? Please give me your valuable feedback on this, suitability of my proposed things and if there are any other things that you are interested to improve on.

1) Improvements to the type of file support - It only supports XML and SGM file formats at the moment. But many of the text sources are delimited files such as, CSV, tab separated, etc. Also, sometimes the individual content resides as individual files. So incorporating browsing for a folder with all the files would be advantageous.

2) Preprocessing - It is advantageous to customize stop world list based on the user requirements. A full list of features (words) can be list down together with the stop word list, and can allow user to customize and finalize the word set as necessary.

3) Bag Of words - it seems TFIDF value is calculated as log(1/frequency). But in text mining literature it shows that, frequency * inverse document frequency as defined below would be an better option too. TF-IDF = TF * IDF , where TF - number of occurences of a particular term / total number of terms in the fileIDF - log (D/(1+d)), D - total number of documents in the text corpus, d - number of documents containing the term

4) Also, dimension reduction techniques such as Latent Semantic Indexing can be integrated to reduce the feature space to a low dimensional feature vector, this will definitely help in text clustering due to its high dimensionality. (This might go under Matrix factorizations as well)

Please let me know your feedback on the above, and also the new ideas you would like to have in Text Clustering Module.

Hi there,There might also be a demand for token substitution. Eg. a number of tokens / words / are collected under a superior category ( like: dog, cat -> domestic_animal ) and during the analysis the substituted values must be counted. It can happen that a text and a category list ( say, a YAMLed Hash ) is loaded separately.Also it might be useful for reducing elements filtering out dialectical variants using similar dictionary.The question is, of course, what is the design policy for the Text mining - processing module set.