Filter Tokens by Content (more than one expression)

ArmMinerArmMiner Member Posts: 35 Contributor II
edited June 2019 in Help
Hi all

I want to filter the tokens of my example set with more than one expression.
For example:
keep those examples, which contain "fast" or "delivery" or "again" words.

Is this possible? If yes, with which operator?  ???
Thanks!

Answers

  • SkirzynskiSkirzynski Member Posts: 164 Maven
    The operator you are looking for is "Filter Example" with the condition class "attribute_value_filter". In the parameter string you can use regular expressions. Here is a process with just this operator which assumes that the text with your tokens to filter is named "text".


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
        <process expanded="true" height="568" width="587">
          <operator activated="true" class="filter_examples" compatibility="5.3.000" expanded="true" height="76" name="Filter Examples" width="90" x="45" y="30">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="text = .*again.*|.*delivery.*|.*fast.*"/>
          </operator>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
    </process>
  • ArmMinerArmMiner Member Posts: 35 Contributor II
    Thanks for the reply!
    Actually, the name "text" it has to be the name of my excel file or?

  • SkirzynskiSkirzynski Member Posts: 164 Maven
    The name of the attribute for the text content.
  • ArmMinerArmMiner Member Posts: 35 Contributor II
    Actually, I'm getting an error. Instead of the "text" I wrote  "Bewertung" which is the column name of the reviews in my data.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="341" width="756">
          <operator activated="true" class="read_database" compatibility="5.2.008" expanded="true" height="60" name="Read Database" width="90" x="45" y="75">
            <parameter key="connection" value="sqlserver"/>
            <parameter key="query" value="SELECT `Bewertung`&#10;FROM `training_schnell`"/>
            <enumeration key="parameters"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="5.2.008" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="75"/>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="75">
            <parameter key="prunde_below_percent" value="5.0"/>
            <parameter key="prune_above_percent" value="100.0"/>
            <list key="specify_weights"/>
            <process expanded="true" height="345" width="774">
              <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30">
                <parameter key="mode" value="specify characters"/>
                <parameter key="characters" value=".:,:;:!:?:|:"/>
              </operator>
              <operator activated="true" class="text:filter_by_length" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="45" y="120">
                <parameter key="max_chars" value="9999"/>
              </operator>
              <operator activated="true" class="text:filter_stopwords_german" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (German)" width="90" x="45" y="210"/>
              <operator activated="true" class="text:stem_german" compatibility="5.2.004" expanded="true" height="60" name="Stem (German)" width="90" x="179" y="30"/>
              <operator activated="false" class="text:filter_tokens_by_content" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="447" y="165">
                <parameter key="string" value="schnell "/>
                <parameter key="regular_expression" value="(schnell)"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (German)" to_port="document"/>
              <connect from_op="Filter Stopwords (German)" from_port="document" to_op="Stem (German)" to_port="document"/>
              <connect from_op="Stem (German)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="false" class="text:wordlist_to_data" compatibility="5.2.004" expanded="true" height="76" name="WordList to Data" width="90" x="313" y="210"/>
          <operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Filter Examples" width="90" x="514" y="30">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="Bewertung = .*wieder.*|.*lieferung.*|.*schnell.*"/>
          </operator>
          <operator activated="true" class="write_excel" compatibility="5.2.008" expanded="true" height="76" name="Write Excel" width="90" x="514" y="165">
            <parameter key="excel_file" value="C:\Users\MP-TEST\Desktop\Rapid_Test\Klein.xls"/>
          </operator>
          <connect from_op="Read Database" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Write Excel" to_port="input"/>
          <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    After the Process Documents operator there is no text attribute anymore! if you had set a breakpoint after that operator, you would have seen that. If you want to filter on a token basis (which is not exactly what you described in the first post), you have to use Filter Tokens by Contents inside process documents.

    I have been following this post and your other post for quite some time, and I get the feeling that it may be a good idea to step one step back, leave the rather complicated text processing aside and get used to the common concepts of RapidMiner and data mining with RapidMiner with the help of our tutorials. That will make it much easier for you to assemble your processes, and debug them if anything does not work. There is also a good book available which is even downloadable for free: search for "Data Mining for the Masses" by Matt North. Here the author explains many concepts of data mining on simple, but realistic examples, starting at a very basic level and advancing to more and more complex topics. Most of the chapters use RapidMiner as the platform for doing the exercises.

    If you have any further questions you are of course invited to ask for help here on the forums!

    All the best,
    Marius
  • ArmMinerArmMiner Member Posts: 35 Contributor II
    Your help is really appreciated and I will get that book. Thanks a lot to this forum, because it's really very nice when experienced users are ready to help.
    Actually, I have tried Tokens by Content, but actually I couldn't figure out how to specify more than one expression in the corresponding field, I tried search the sample syntax, but no result. :-\
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    You can "combine" several regular expressions with the vertical bar, e.g.
    .*dog.*|.*cat.*|.*fish.*
    would match doghouse, catfish, fish food, and anything else containing one of the words.
    Regular expressions are quite complex, and there exist complete books on only this topics. The basic syntax however can be quickly learned from tutorials on the internet.

    RapidMiner contains a regular expression dialog, where you can directly test the expressions you entered. It is available in many parameters where you can enter regular expressions, e.g. Select Attributes (with attribute_filter_type = regular_expression). Since the dialog is quite new, not all fields have been ported, so its not yet in Filter Tokens. There are also many free regex testers on the internet.

    Happy Mining!
    ~Marius
  • ArmMinerArmMiner Member Posts: 35 Contributor II
    Hi Marius

    Thanks for the help. I will try and give the feedback! :)

    Best regards
    Armen
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    I forgot to set the "condition" in Filter Tokens to "matches" - only if you do that, you can use regular expressions, and you also get the dialog oO
  • rajbanokhanrajbanokhan Member Posts: 29 Maven
    hi
    you use this example it work but can you tell me in this expression
    *dog.*|.*cat.*|.*fish.*
    what is the meaning of * and . which are use in this expression.
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    In regex, the "." is the wildcard character.  The "*" is a special code that indicates 'zero or more of the preceding character' so ".*" is basically the expression for anything.  So the expression above looks for anything that contains the string "dog" or "cat" or "fish" anywhere in the token.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • rajbanokhanrajbanokhan Member Posts: 29 Maven
    thank you so much i got it.
    for regular expression can you suggest me a book
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    "Regular Expressions in 10 Minutes" by Ben Forta is a good introduction and it is available on Amazon for low cost.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • rajbanokhanrajbanokhan Member Posts: 29 Maven
    hi sir
    i have a problem when i use regular expression with match condition
    *dog.*|.*cat.*|.*fish.*
    in result only dog and cat were come. the third one (fish) were not showing in result

  • kaymankayman Member Posts: 662 Unicorn
    edited December 2018
    To avoid the obvious, there are cases that contain fish, or do you have cases that contain for instance both cat and fish?

    The used regular expression is a bit on the greedy side, meaning you can have a lot of results, but not the right ones depending on how your text is structured.

    In the given example it will only match the exact case, so if you have for instance Fish (with capital F) , it will not match. it will also capture fishing, hotdog, category and so on, and while that might be useful for some scenarios it can also lead to unexpected results again.

    There are ways to improve this, using some of the more advanced yet cool options of regular expressions.

    you could use groups to start with, that reduces the wildchards already and makes it more readable and less error prone.

    The above becomes then 
    .*(cat|dog|fish).*

    It does exactly the same, it reads as 'take whatever you want (the dot), as many times as you like (the asterix) followed by either cat, dog or fish, and then again followed by whatever as much as you want.

    This is what we call a greedy pattern, we don't care of what we get and how much we have. This si typically no problem when dealing with small sentences, but can cost you a lot of memory when you have long content.

    so one small improvement already :

    .*?(cat|dog|fish).*^

    Ok, 2 small changes. The first is the 'hat' (^), which means, begin at the start of the sentence, and the question mark, which means 'end at the first match. So using ^.*? is short for begin at the start, and end as soon as you find the first match. This can save quite some time again with large texts, as the original one will just keep looking for matches untill he is at the end of the sentence.

    Now, we still can only match lower case, and while it is good practice to set all of your cases either lower or upper in a text analysis workflow, there are occasions where we need the difference of course. Anyway, to ignore cases we use the i flag as follows :

    ^.*?(cat|dog|fish).*(?i)

    So now it will find cat, Cat, CAT, and whatever else. Should that be a requirement of course.

    You can combine many flags together, while the i flag means ignore case, the m flag can be used to indicate you can have multiple lines. combining them as below would mean that every sentence, when using line breaks, would get the same treatment.

    ^.*?(cat|dog|fish).*</code><code>(?im)

    the order doesn't matter, (?mi) would work exactly the same.

    Now, we still have the problem we can get things like category or hotdog in the results, so the final part would be to use the word boundary, so that we are ensured we only get a match when it is exactly the same word. A word boundary can be anything like a comma, a dot, a space, end or beginning of sentence etc. Luckily there is a little helper again, so the below will give you an exact match, stop at the first match, looking at every line you have.

    (?im)^.*?\b(cat|dog|fish)\b.*</code><code>

    As an alternative you could also use the s flag when you have a lot of line breaks, and this will ignore all linebreaks and treat your text as one single line.

    ^.*?\b(cat|dog|fish)\b.*</code><code>(?is)

    FINAL EDIT : it seems the code block screws the content a bit up, all of the symbols used need to be in one single line.

    
    

  • rajbanokhanrajbanokhan Member Posts: 29 Maven
    thank you so much for giving me such nice advice and i try it it work right.
  • rajbanokhanrajbanokhan Member Posts: 29 Maven
    hi sir
    i have a question and its about operator name. the operator filter token by content. can i say we are searching for words by using this operator or find some thing specific and said how i mention it. does these words are right to saying that "for specific or searching".
  • rajbanokhanrajbanokhan Member Posts: 29 Maven
    hi sir
    how i write regular expression for matching all tokens. for example i have two documents and words list of document 1 match to document 2. and the words in document 2 which are not match donnt appear in result.

  • kaymankayman Member Posts: 662 Unicorn
    If you have different wordlists you might try the join operators. Convert your wordlists to data, link your word attributes and inner join will return the ones you have in both, and if you use the Set Minus operator you can filter on the words that appear in one set but not in the other.

    Regex is probably not going to work here if that is what you want to achieve.
  • rajbanokhanrajbanokhan Member Posts: 29 Maven
    hi sir
    thank you for giving me suggestion. actually i am working on regular expression that's why i have concern with regular expression. i am using filter token by content using match statement. if there is a way of regular expression. if not its ok.
  • kaymankayman Member Posts: 662 Unicorn
    Would you have some examples you can share? This might make it easier to understand the actual problem
  • rajbanokhanrajbanokhan Member Posts: 29 Maven
    in filter token by content i used regular expression with match option. it work for select specific words but list of specific words are too large (200- 300 words) so the above regular expression doesnt fit on it. so i try to match one list to another, i hope i convey my message.
  • rajbanokhanrajbanokhan Member Posts: 29 Maven
    hi sir 
    i have a question where we write regular regular expression what that box called in rapidminer?
  • kaymankayman Member Posts: 662 Unicorn
    all replace operators support regex.
Sign In or Register to comment.