Removing HTTP Headers

hawle087 · April 2012

I'm trying to do some text analytics on a set of pre-downloaded html files but unfortunately they also include the HTTP headers (e.g. Content-type: text/html). I've tried using Remove Document Parts with regular expressions to strip out the headers before passing the document to Extract Content, but for some reason the Extract Content operator ignores the removals. To test this I setup a simple process that takes a text file as input containing the words "one two three". The Remove Document Parts removes the word one (checked via breakpoint) but the final output includes it. Can anyone help me understand why Extract Content is ignoring the prior removal, or suggest some workarounds or alternate methods of removing HTTP headers from files?

Thanks.


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
    <process expanded="true" height="460" width="899">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.2.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="test" value="C:\Users\XXX\test_files"/>
        </list>
        <process expanded="true" height="460" width="899">
          <operator activated="true" class="text:remove_document_parts" compatibility="5.2.001" expanded="true" height="60" name="RM One" width="90" x="45" y="30">
            <parameter key="deletion_regex" value="one"/>
          </operator>
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="179" y="30">
            <parameter key="minimum_text_block_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="RM One" to_port="document"/>
          <connect from_op="RM One" from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Updated:

As a workaround I used Replace Tokens after the Extract Content operator, though this is less than ideal for pattern matching.

Nils_Woehler · April 2012

Hi,

if you place a 'Combine Documents' operator after the 'Remove Document Parts' it worked for me.



<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
    <process expanded="true" height="341" width="413">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.2.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="313" y="75">
        <list key="text_directories">
          <parameter key="test" value="C:\Users\XXX\test"/>
        </list>
        <process expanded="true" height="461" width="889">
          <operator activated="true" class="text:remove_document_parts" compatibility="5.2.001" expanded="true" height="60" name="RM One" width="90" x="179" y="30">
            <parameter key="deletion_regex" value="one"/>
          </operator>
          <operator activated="true" class="text:combine_documents" compatibility="5.2.001" expanded="true" height="76" name="Combine Documents" width="90" x="313" y="30"/>
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="514" y="30">
            <parameter key="minimum_text_block_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="RM One" to_port="document"/>
          <connect from_op="RM One" from_port="document" to_op="Combine Documents" to_port="documents 1"/>
          <connect from_op="Combine Documents" from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

Best,
Nils

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Removing HTTP Headers

Answers