Options

whitespace regular expression filter tokens by content

student24student24 Member Posts: 7 Contributor II
Hello everybody,

I want to search words from documents. I use the operator Filter Tokens by content with regular expression. If I want to search more than one word I use word1|word2|...|wordn. Now my question is how can I search an expression where there is a whitespace? For example "Research and Development|Word2|Word3 etc. ". Is there any wildcard for whitespaces?

Thanks for your help

Answers

  • Options
    RalfKlinkenbergRalfKlinkenberg Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member, Unconfirmed, University Professor Posts: 68 RM Founder
    You can use
    • [tt]\s[/tt]  as a placeholder for a whitespace character,
    • [tt]\s+[/tt]  for one or more whitespace characters, and
    • [tt]\s*[/tt]  for zero, one, or more whitespace characters.
    • [tt]\t[/tt]  is a placeholder for tabulator symbols.
    RapidMiner regular expressions use the Java syntax for regular expressions. If you search for "[tt]Java regular expressions[/tt]" with Google or another search engine, you will find a lot of documentation.

    Best wishes,
    Ralf
  • Options
    student24student24 Member Posts: 7 Contributor II
    Thank you very much for your reply.

    I have tried these out before but it doesnt work. There are no results in the word list although the expression is in the document. I dont know what I'm doing wrong. Do you know if it works when I'm examining pdf files?

    Thanks
  • Options
    RalfKlinkenbergRalfKlinkenberg Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member, Unconfirmed, University Professor Posts: 68 RM Founder
    If you post the XML code of your RapidMiner process here, there is a chance that someone in the forum maybe able to help.

    Without being able to see the RapidMiner process, we can only guess where the problem in your RapidMiner process might be.  ;)
  • Options
    student24student24 Member Posts: 7 Contributor II
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="30">
           <list key="text_directories">
             <parameter key="test" value="C:"/>
           </list>
           <parameter key="keep_text" value="true"/>
           <process expanded="true">
             <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
             <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="380" y="30"/>
             <operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="514" y="30">
               <parameter key="condition" value="matches"/>
               <parameter key="regular_expression" value="research\sand\sdevelopment"/>
             </operator>
             <connect from_port="document" to_op="Tokenize" to_port="document"/>
             <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
             <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
             <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
         <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
         <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="source_input 2" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>

  • Options
    frasfras Member Posts: 93 Contributor II
    For some reason your XML is not valid but the important line is this:
      <parameter key="regular_expression" value="research\sand\sdevelopment"/>
    If you search for "Research..." this Regex will fail because upper/lower case
    matters unless you ignore it by applying the regex switch "i".
    It will fail also if there are more than one whitespaces between the words.
  • Options
    student24student24 Member Posts: 7 Contributor II
    I thought I ingnore the upper/lower case by the operator "Transform Cases" and select the option "lower case". For more than one whitespace I could use \s+ but it also doesnt work.
    Why is my XML not valid? :)
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Corrected XML above.
  • Options
    student24student24 Member Posts: 7 Contributor II
    ok thank you. I copied it but the problem with the whitespaces isnt solved. I dont know what Im doing wrong.
    Is there maybe another operator or another way I can search for expression in documents?
Sign In or Register to comment.