Text Classification

ar4oar4o Member Posts: 8 Contributor I
edited September 2020 in Help

Hi there!

 

I have tried to find something which would help me on this forum but couldn't. Hopefully, someone will answer me and I would be able to solve the issue. 

 

Let me first a bit describe the task. I have 2 datasets, which contain 2 columns: sentence and label. There are 2 possible labels - true or false. I also have 3 dictionaries of phrases (they can be unigrams, bigram, 3-grams,...).

 

What I want to do:

1) To train SVM classifier on dataset1 and test it on the same dataset (I did it sucessfully with cross-validation). 

2) To train SVM classifier on dataset2 and apply the model on dataset1.

3) Use dictionary of phrases as features to dataset1.

 

My questions:

1) As far as I understand, if I want to train model on one dataset and test it on another, I have to use the same set of features. So I am trying to use the operator "Process documents from data" with the same staff inside (tokenizer, stemming, filtering out stopwords,...) than I take the wordlist of dataset2 and trying to add it as an input to the next "Process documents from data" as a wordlist.

 

Снимок экрана 2017-04-29 в 14.55.01.png

But while running I get this error message:

Снимок экрана 2017-04-29 в 14.56.31.png

In WikiTraining I have 10000 sentences, in debates 2000. 

But I don't get the problem. Can someone please explain me and how can I avoid it?

 

2) How can I use separate CSV-files with phrases (let's call it dictionaries) as my features in a dataset? Let's say that my dictionary contains only triggers, which says that this sentence is of class TRUE. How can I do that?

 

Thank you in advance!

Best Answer

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    Solution Accepted

    Ok let me understand a bit better here. Do you want to train a model on those sentences? So you would have a data set with an attribute column of "in a new direction" or "this is terrible" and have the corresponding label "positive" and "negative" respectively associated with it? If yes, you might want to change the parameter on the tokenizer from non-letters to liguistic sentences, and try again. 

     

    If not, and you want it to be part of a dictionary, you should use the approach that Martin took here: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/How-to-Build-a-Dictionary-Based-Sentiment-Model-in-RapidMiner/ta-p/36067

     

    What you would have to do is put them into a CSV file and delimit using a comma or something. 

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    This error means that whatever you called your label column ended up being a word that was tokenized in your sentences/documents. Rename your label column to something like "_label" and try again.
  • ar4oar4o Member Posts: 8 Contributor I

    Thank you Thomas_Ott!

     

    I even didn't take into consideration that it can cause a problem but sure! Thank you!

     

    And can anyone give any advice regarding the second question?

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    Take a look at the knowledgebase for creating your own sentiment dictionary. I'm currently AF K so this is a short reply
  • ar4oar4o Member Posts: 8 Contributor I

    Couldn't find anything helpful. Only information about using existing dictionaries and most of the adviced are based on installing the extension for a specific dictionary.

  • ar4oar4o Member Posts: 8 Contributor I

    Does anyone else have some advices or links? Not asking for solutions. 

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    The Wordnet extension (free in the Marketplace) has an operator that allows you to use a custom sentiment dictionary in the SentiWordnet format.  See that extension for more details.  

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • ar4oar4o Member Posts: 8 Contributor I

    Thank you for your reply.

     

    One last question. 

    The WordNet dictionary is basically... a dictionary where 1 observation is 1 word.

     

    What I need is a bit different — I want to see let's say "some experts", "in a new direction", "some challenges". So 2 or more words as one observation. 

     

    So as a result I want to see that each feature of my SVM classifier would be presented as these phrases in brackets above. 

     

    Do you have any hint/idea on it as well?

  • ar4oar4o Member Posts: 8 Contributor I

    Haven't tried yet your proposal but it sound like what I have been looking for!

    Thank you very much!

Sign In or Register to comment.