"Difficulties using Filter Tokens (by Region) operator"

Troader · December 2012

I am using the text processing extension to extract information from patent files. If I use Tokenization and some other filters (like Stoppword - Filter) it works fine.
If I work with the Filter Tokens (by Region) operators I am getting zero results. The condition is: Contains "Klebstoff", no case sensitive. This expression appears many times in the readed documents. Interestingly, the program complains if I select the option contains that the regular expression must be specified. In my thought I need this regular expression only if I select the match condition. I am wrong here?

My idea is the automatic extraction from patentfiles content around a given subject. Any help is willcome, I am working on my master thesis.

For the test I have put in the same expression for the condition regular expression and search string. Without defining the regular expression the filter does not work.


 <process expanded="true" height="251" width="614">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="B24" value="D:\Test_Information_Extraktion2\Deutsch"/>
        </list>
        <parameter key="file_pattern" value="*.pdf"/>
        <process expanded="true" height="466" width="882">
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30">
            <parameter key="language" value="German"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30">
            <parameter key="transform_to" value="upper case"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_regions" compatibility="5.2.004" expanded="true" height="60" name="Filter Tokens (by Region)" width="90" x="332" y="30">
            <parameter key="string" value="KLEBSTOFF"/>
            <parameter key="regular_expression" value="KLEBSTOFF"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Region)" to_port="document"/>
          <connect from_op="Filter Tokens (by Region)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="90"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="source_input 2" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Skirzynski · December 2012

Posting the XML of your process is a good idea to get help, but unfortunately the XML is not valid.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Difficulties using Filter Tokens (by Region) operator"

Answers