Replace token and extract information inconsistencies

kayman · January 2015

I'm trying to extract content from a message board that contains pretty ugly html, I manage to clean it up but what I see is not what I get it seems.

I'm interested in getting the content within a div with lia-message-body-content class, but fail to get it working

The process seems fairly straightforward, I'm using a few replace operators to clean the mess, and load the final cleaned result in an extract information operator. However, there seems to be a difference between what goes out of the final replace operator and what is used within the extract operator.

ORIGINAL INPUT


<html>
<head><title>my Title</title></head>
<body>

<div class="lia-message-body lia-component-body" id="messagebodydisplay" class="lia-message-body">
<div class="lia-message-body-content">
<!-- result but bad markup start -->
<HTML><HEAD></HEAD><BODY>
<P>Some content</P>
<P></P>
<P>following reply by e-mail :-</P>
<P></P>
<P>"some more content with an ugly break</P>
<P><BR />and rest of the content"</P>
<P></P>
<P>*****Some string with additional characters*****</P>
<P></P>
<P>Message was edited by: Somebody</P>
</BODY></HTML>
<!--bad markup end -->
</div>
</div>
</body>
</html>

The bad tags (all uppercase's) are the ones I want to remove from the document, that seems to work fine as below is the output from the final replace operator :

OUTPUT FROM FINAL REPLACE


<html>
<head><title>my Title</title></head>
<body>
<div class="lia-message-body lia-component-body" id="messagebodydisplay_0" class="lia-message-body">

<div class="lia-message-body-content">
<p>Some content</p>
<p>following reply by e-mail :-</p>
<p>"some more content with an ugly break</p>
<p>and rest of the content"</p>
<p>*****Some string with additional characters*****</p>
<p>Message was edited by: Somebody</p>
</div>

</div>
</body>
</html>

So at first glance all looks ok, the bad tags are gone, and it looks like proper and valid html. This content is entered in the extract operator using following xpath :

//h:div[@class="lia-message-body-content"]

OUTPUT FROM EXTRACT


<div xmlns="http://www.w3.org/1999/xhtml" class="lia-message-body-content" />

So no data apart from the actual div itself

If I modify my source data and remove the bad tags upfront I get the same output from my final replace operator, but total different result from my extract operator :

NEW INPUT (pre-modified) :


<html>
<head><title>my Title</title></head>
<body>
<div class="lia-message-body lia-component-body" id="messagebodydisplay_0" class="lia-message-body">	
	<div class="lia-message-body-content">
		<P>Some content</P>
		<P></P>
		<P>following reply by e-mail :-</P>
		<P></P>
		<P>"some more content with an ugly break</P>
		<P><BR />and rest of the content"</P>
		<P></P>
		<P>*****Some string with additional characters*****</P>
		<P></P>
		<P>Message was edited by: Somebody</P>
	</div>
</div>
</body>
</html>

OUTPUT FROM FINAL REPLACE


<html>
<head><title>my Title</title></head>
<body>

<div class="lia-message-body lia-component-body" id="messagebodydisplay_0" class="lia-message-body">
<div class="lia-message-body-content">
<p>Some content</p><p>following reply by e-mail :-</p>
<p>"some more content with an ugly break</p>
<p>and rest of the content"</p>
<p>*****Some string with additional characters*****</p>
<p>Message was edited by: Somebody</p>
</div>

</div>
</body>
</html>

So, looks exactly the same as with the original data

OUTPUT FROM EXTRACT


<div xmlns="http://www.w3.org/1999/xhtml" class="lia-message-body-content">
  <p>Some content</p>
  <p />
  <p>following reply by e-mail :-</p>
  <p />
  <p>"some more content with an ugly break</p>
  <p>
    <br clear="none" />
    and rest of the content"
  </p>
  <p />
  <p>*****Some string with additional characters*****</p>
  <p />
  <p>Message was edited by: Somebody</p>
</div>

Hmm... Now I'm getting a result, but the empty tags and breaks I supposedly removed in the previous steps are suddenly back. So somehow there seems to be no relation with my cleaned data at all. Apparently it is using the source data (before the cleaning took place) and therefore the original script provides no output as it breaks down the page in html containers.

Any idea where it goes wrong, or what step I miss ?

This is the XML as used in the project


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.005">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="246" y="210">
        <list key="text_directories">
          <parameter key="tv" value="D:\mining\TV"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="use_file_extension_as_type" value="false"/>
        <parameter key="encoding" value="UTF-8"/>
        <parameter key="create_word_vector" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="Clean Up Step 1" width="90" x="112" y="30">
            <list key="replace_dictionary">
              <parameter key="[\t\r\n]" value=" "/>
              <parameter key="&lt;script\b[^&gt;]*&gt;(.*?)&lt;/script&gt;" value=" "/>
              <parameter key="&lt;span\b[^&gt;]*&gt;\s?&lt;/span&gt;" value=" "/>
              <parameter key="&lt;i\b[^&gt;]*&gt;\s?&lt;/i&gt;" value=" "/>
              <parameter key="(?i)&lt;p\b[^&gt;]*&gt;\s?&lt;/p&gt;" value=" "/>
              <parameter key="&lt;link\b[^&gt;]*&gt;(.*?)&lt;/link&gt;" value=" "/>
              <parameter key="&lt;BR /&gt;" value=" "/>
            </list>
          </operator>
          <operator activated="true" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="fix bad para tags" width="90" x="112" y="120">
            <list key="replace_dictionary">
              <parameter key="&lt;(/?)P&gt;" value="&lt;$1p&gt;"/>
            </list>
          </operator>
          <operator activated="true" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="remove bad tags" width="90" x="112" y="210">
            <list key="replace_dictionary">
              <parameter key="&lt;/?[A-Z]+/?&gt;" value=" "/>
            </list>
          </operator>
          <operator activated="true" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="trim spaces" width="90" x="112" y="300">
            <list key="replace_dictionary">
              <parameter key="\s{2,}" value=" "/>
              <parameter key="&gt;\s&lt;" value="&gt;&lt;"/>
              <parameter key="&gt;\s" value="&gt;"/>
            </list>
          </operator>
          <operator activated="true" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="246" y="300">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="response" value="//h:div[@class=&amp;quot;lia-message-body-content&quot;]"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="document" to_op="Clean Up Step 1" to_port="document"/>
          <connect from_op="Clean Up Step 1" from_port="document" to_op="fix bad para tags" to_port="document"/>
          <connect from_op="fix bad para tags" from_port="document" to_op="remove bad tags" to_port="document"/>
          <connect from_op="remove bad tags" from_port="document" to_op="trim spaces" to_port="document"/>
          <connect from_op="trim spaces" from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Replace token and extract information inconsistencies