Options

"Remove all lines with text occurency smaller than 10 from certain column"

MoritzMoritz Member Posts: 3 Contributor I
edited June 2019 in Help

Hi, 

Im trying to refer to a certain column of the sample set and remove all lines smaller than 10. Whats the way to do that? 

e.g. 

Process Documents from Files >> Filter Stopwords >> Tokenize >> Transform Cases >> Stem >> ??? now remove all lines where the clumn "text occurence" is lower than 10 ???

Answers

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Your question is a bit confusing.  Do you want to get rid of tokens that occur less than 10 times, or sentences (lines) that have fewer than 10 tokens?  In either case, RapidMiner can do it.  In the first case, just use the pruning options in Process Documents and set an absolute threshold of 10.  In the 2nd case, split each sentence into a separate document (you can use "Cut Documents" for this) and then "Extract Token Number" and then filter for any document (sentence) that has token length fewer than 10. 

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    MoritzMoritz Member Posts: 3 Contributor I

    Thanks for the fast answer. 

     

    Let me try to rephrase a bit: The task is to remove all words from our document with a total occurance smaller than 10. I already tried the pruning operator, but since there is no option to refer to the column "total occurance", i dont have the opporutity to prune after it / remove all words with a smaller occurance than 10


    @Telcontar120 wrote:

    Your question is a bit confusing.  Do you want to get rid of tokens that occur less than 10 times, or sentences (lines) that have fewer than 10 tokens?  In either case, RapidMiner can do it.  In the first case, just use the pruning options in Process Documents and set an absolute threshold of 10.  In the 2nd case, split each sentence into a separate document (you can use "Cut Documents" for this) and then "Extract Token Number" and then filter for any document (sentence) that has token length fewer than 10. 



     

     

     

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Ah, got it.  "Wordlist to Data" will let you take the wordlist and turn it into an exampleset and then you will be able to Filter on the "Total Occurrences" column.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    MoritzMoritz Member Posts: 3 Contributor I

    Okay. I did the first part, but I still cant filter for columns. Where do I apply the filter? / Which filter do I apply

     

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Use "Filter Examples" and then set your condition to values where the Total Occurrence column is greater than 10.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.