Removing HTTP Headers

hawle087hawle087 Member Posts: 1 Contributor I
edited November 2018 in Help
I'm trying to do some text analytics on a set of pre-downloaded html files but unfortunately they also include the HTTP headers (e.g. Content-type: text/html). I've tried using Remove Document Parts with regular expressions to strip out the headers before passing the document to Extract Content, but for some reason the Extract Content operator ignores the removals. To test this I setup a  simple process that takes a text file as input containing the words "one two three". The Remove Document Parts removes the word one (checked via breakpoint) but the final output includes it. Can anyone help me understand why Extract Content is ignoring the prior removal, or suggest some workarounds or alternate methods of removing HTTP headers from files?

Thanks.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
   <process expanded="true" height="460" width="899">
     <operator activated="true" class="text:process_document_from_file" compatibility="5.2.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
       <list key="text_directories">
         <parameter key="test" value="C:\Users\XXX\test_files"/>
       </list>
       <process expanded="true" height="460" width="899">
         <operator activated="true" class="text:remove_document_parts" compatibility="5.2.001" expanded="true" height="60" name="RM One" width="90" x="45" y="30">
           <parameter key="deletion_regex" value="one"/>
         </operator>
         <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="179" y="30">
           <parameter key="minimum_text_block_length" value="3"/>
         </operator>
         <connect from_port="document" to_op="RM One" to_port="document"/>
         <connect from_op="RM One" from_port="document" to_op="Extract Content" to_port="document"/>
         <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
     <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
     <portSpacing port="sink_result 3" spacing="0"/>
   </process>
 </operator>
</process>
Updated:

As a workaround I used Replace Tokens after the Extract Content operator, though this is less than ideal for pattern matching.

Answers

  • Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    Hi,

    if you place a 'Combine Documents' operator after the 'Remove Document Parts' it worked for me.


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.003">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
       <process expanded="true" height="341" width="413">
         <operator activated="true" class="text:process_document_from_file" compatibility="5.2.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="313" y="75">
           <list key="text_directories">
             <parameter key="test" value="C:\Users\XXX\test"/>
           </list>
           <process expanded="true" height="461" width="889">
             <operator activated="true" class="text:remove_document_parts" compatibility="5.2.001" expanded="true" height="60" name="RM One" width="90" x="179" y="30">
               <parameter key="deletion_regex" value="one"/>
             </operator>
             <operator activated="true" class="text:combine_documents" compatibility="5.2.001" expanded="true" height="76" name="Combine Documents" width="90" x="313" y="30"/>
             <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="514" y="30">
               <parameter key="minimum_text_block_length" value="3"/>
             </operator>
             <connect from_port="document" to_op="RM One" to_port="document"/>
             <connect from_op="RM One" from_port="document" to_op="Combine Documents" to_port="documents 1"/>
             <connect from_op="Combine Documents" from_port="document" to_op="Extract Content" to_port="document"/>
             <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
       </process>
     </operator>
    </process>
    Best,
    Nils
Sign In or Register to comment.