Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Cannot filter tokens (sentences) by content using regular expressions

KateShKateSh Member Posts: 2 Learner I
edited September 2021 in Help
Hello everyone!
I'm new to text mining. A very simple task turned out to be an unsolvable one for me  :(

I have 50 pdf documents in English. From there I need to extract the sentences which contain at least one modal verb (for further analysis).
Inside the "process documents from files" operator I created "tokenize" (linguistic sentences) and "filter tokens by content" operators. In "filter tokens by content" I wrote the verbs divided by a vertical line with no spaces, but it doesn't work, the results are empty. It works fine if I write only one verb, but if I write many verbs with a vertical line, it doesn't. I tried all the conditions of the operator, none of them make it work.
I will be very grateful for help!
Here is my process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="313" y="120">
        <list key="text_directories">
          <parameter key="pdf" value="D:\Все\УЧЁБА\ВКР\Материал\Оригинальные"/>
        </list>
        <parameter key="file_pattern" value="*pdf"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="179" y="120">
            <parameter key="mode" value="linguistic sentences"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="380" y="120">
            <parameter key="condition" value="matches"/>
            <parameter key="string" value="can|could|may|might"/>
            <parameter key="regular_expression" value="can|could|may|might"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>


Best Answers

  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Solution Accepted
    Hi @KateSh,

    Did you try the parameter "contains" instead of "matches" ?

    Regards,

    Lionel
  • lionelderkrikorlionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Solution Accepted
    Hi again @KateSh,

    Otherwise, did you try the Filter Tokens Using Example Set operator : Check the tutorial of this process

    Regards,

    Lionel

Answers

  • KateShKateSh Member Posts: 2 Learner I
    Thank you so mush, it helped!
    (Sorry I didn't answer earlier, I was very busy yesterday)
Sign In or Register to comment.