Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Using "Cut Document" Operator neglects numbers and punctuation in HTML text

Limegreenman900Limegreenman900 Member Posts: 6 Contributor II
edited November 2018 in Help
Hi everyone,

I am currently using the "Cut Document" Operator with query type "Regular Region" to extract specific text out of locally stored  HTML files.
This works pretty good so far, however it seems as all numbers in the text are being neglected.

i.e. Original Text:
<td style=" width:100.00%; text-align:justify; " class="ta_10"><span class="ta_10">Companies Act 2006. Our audit work has been undertaken so that we might state to the company's members those</span></td>
<td style=" width:100.00%; text-align:justify; " class="ta_10"><span class="ta_10">concerning the cost of the fixed asset investment, stated at £51,925 in note 6  to the financial statements.</span></td>

Text after extraction:
Companies Act Our audit work has been undertaken so that we might state to the company s members those
concerning the cost of the fixed asset investment stated at  in note to the financial statements

Also punctuation characters like , and . are neglected. Anyone has an idea if there is a setting to get both, punctuation characters and numbers?

My code right now looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Read Document" width="90" x="112" y="30">
       <parameter key="file" value="C:\Users\Independent Auditors Report\Prod224_0010_00178176_20131231.html"/>
       <parameter key="extract_text_only" value="false"/>
     </operator>
     <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document" width="90" x="246" y="30">
       <parameter key="query_type" value="Regular Region"/>
       <list key="string_machting_queries"/>
       <list key="regular_expression_queries"/>
       <list key="regular_region_queries">
         <parameter key="Independent Report" value="(?i)(&gt;[^&gt;]+Independent Auditors(')? to[^&lt;]+&lt;).name=&quot;[^&quot;]+NameSeniorStatutoryAuditor&quot;"/>
       </list>
       <list key="xpath_queries"/>
       <list key="namespaces"/>
       <list key="index_queries"/>
       <process expanded="true">
         <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.002" expanded="true" height="60" name="Extract Content (2)" width="90" x="112" y="30">
           <parameter key="minimum_text_block_length" value="3"/>
         </operator>
         <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="313" y="30"/>
         <operator activated="true" class="text:extract_token_number" compatibility="5.3.002" expanded="true" height="60" name="Extract Token Number" width="90" x="514" y="30"/>
         <connect from_port="segment" to_op="Extract Content (2)" to_port="document"/>
         <connect from_op="Extract Content (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
         <connect from_op="Tokenize (2)" from_port="document" to_op="Extract Token Number" to_port="document"/>
         <connect from_op="Extract Token Number" from_port="document" to_port="document 1"/>
         <portSpacing port="source_segment" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="120">
       <list key="text_directories">
         <parameter key="test" value="C:\Users\ndependent Auditors Report\Teil 1"/>
       </list>
       <process expanded="true">
         <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.002" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
         <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
         <connect from_port="document" to_op="Extract Content" to_port="document"/>
         <connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
         <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Read Document" from_port="output" to_op="Cut Document" to_port="document"/>
     <connect from_op="Cut Document" from_port="documents" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>

Answers

  • Limegreenman900Limegreenman900 Member Posts: 6 Contributor II
    Ok, it looks like that it has been due to my "Tokenize" Operator I used in "Cut Documents". If I am using my process without it I get plain text with punctuation and numbers.

    If I use "linguistic tokens - english" as setting in the tokenize operator it works perfectly.

Sign In or Register to comment.