Text Classification

Text Classification

Hi there!

I have tried to find something which would help me on this forum but couldn't. Hopefully, someone will answer me and I would be able to solve the issue.

Let me first a bit describe the task. I have 2 datasets, which contain 2 columns: sentence and label. There are 2 possible labels - true or false. I also have 3 dictionaries of phrases (they can be unigrams, bigram, 3-grams,...).

What I want to do:

1) To train SVM classifier on dataset1 and test it on the same dataset (I did it sucessfully with cross-validation).

2) To train SVM classifier on dataset2 and apply the model on dataset1.

3) Use dictionary of phrases as features to dataset1.

My questions:

1) As far as I understand, if I want to train model on one dataset and test it on another, I have to use the same set of features. So I am trying to use the operator "Process documents from data" with the same staff inside (tokenizer, stemming, filtering out stopwords,...) than I take the wordlist of dataset2 and trying to add it as an input to the next "Process documents from data" as a wordlist.

But while running I get this error message:

In WikiTraining I have 10000 sentences, in debates 2000.

But I don't get the problem. Can someone please explain me and how can I avoid it?

2) How can I use separate CSV-files with phrases (let's call it dictionaries) as my features in a dataset? Let's say that my dictionary contains only triggers, which says that this sentence is of class TRUE. How can I do that?

Re: Text Classification

Ok let me understand a bit better here. Do you want to train a model on those sentences? So you would have a data set with an attribute column of "in a new direction" or "this is terrible" and have the corresponding label "positive" and "negative" respectively associated with it? If yes, you might want to change the parameter on the tokenizer from non-letters to liguistic sentences, and try again.