(Solved) Removing tags from extracted data

jx820 · July 2012

I'm very new and starting with a scraping process. It really doesn't have a function, I'm just playing around trying to learn. My process was originally based on Neil McGuigan's tutorials on Vancouver Data Blog, but as I try new things it's grown a bit.

Currently I'm crawling with the Process Documents from Web operator and using Extract Information as a sub process. I'm querying 9 attributes with xpath querys. Last I use Write Excel to output the data into a spreadsheet. All of that works fine.

The problem is the information extracted contains HTML tags, specifically H1 and TD tags and I can't find a means of removing them. I've tried an Extract Content operator, Remove Document Parts, and Replace. So far nothing has worked.

This is what a typical result looks like:

<td xmlns="http://www.w3.org/1999/xhtml" colspan="1" rowspan="1">33</td>

But all I need is the 33.

Here's the XML behind my process:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <parameter key="logfile" value="C:\Users\Public\Documents\Rapidminer Repository\logfile"/>
    <parameter key="resultfile" value="C:\Users\Public\Documents\Rapidminer Repository\resultfile"/>
    <process expanded="true" height="620" width="435">
      <operator activated="true" class="web:process_web" compatibility="5.2.001" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="30">
        <parameter key="url" value="http://www.mixedmartialarts.com/f/1BC00DA3949506AC/BJ-Penn/"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value="http://www\.mixedmartialarts\.com/f/.*"/>
        </list>
        <parameter key="max_pages" value="6"/>
        <parameter key="max_depth" value="4"/>
        <parameter key="domain" value="server"/>
        <parameter key="delay" value="5000"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/14.0.1"/>
        <parameter key="parallelize_process_webpage" value="true"/>
        <process expanded="true" height="620" width="433">
          <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="179" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Fighter" value="//h:div[@class='Resume']/h:h1"/>
              <parameter key="Pro Record" value="//h:*[contains(.,'Pro Record:')]/../h:td[last()]"/>
              <parameter key="Team" value="//h:*[contains(.,'Team:')]/../h:td[last()]"/>
              <parameter key="Age" value="//h:*[contains(.,'Age:')]/../h:td[last()]"/>
              <parameter key="Sex" value="//h:*[contains(.,'Sex:')]/../h:td[last()]"/>
              <parameter key="Height" value="//h:*[contains(.,'Height:')]/../h:td[last()]"/>
              <parameter key="Weight" value="//h:*[contains(.,'Weight:')]/../h:td[last()]"/>
              <parameter key="Out of" value="//h:*[contains(.,'Out of:')]/../h:td[last()]"/>
              <parameter key="From" value="//h:*[contains(.,'Born:')]/../h:td[last()]"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="write_excel" compatibility="5.2.008" expanded="true" height="76" name="Write Excel" width="90" x="313" y="30">
        <parameter key="excel_file" value="C:\Users\Public\Documents\Rapidminer Repository\Results\Results.xls"/>
      </operator>
      <connect from_op="Process Documents from Web" from_port="example set" to_op="Write Excel" to_port="input"/>
      <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="18"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I checked the FAQ, the tutorials, and searched the forums, but I haven't found anything. Any suggestions?

Nils_Woehler · July 2012

Hi,

you can use the XPath text() function:



<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <parameter key="logfile" value="C:\Users\Public\Documents\Rapidminer Repository\logfile"/>
    <parameter key="resultfile" value="C:\Users\Public\Documents\Rapidminer Repository\resultfile"/>
    <process expanded="true" height="620" width="435">
      <operator activated="true" class="web:process_web" compatibility="5.2.001" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="30">
        <parameter key="url" value="http://www.mixedmartialarts.com/f/1BC00DA3949506AC/BJ-Penn/"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value="http://www\.mixedmartialarts\.com/f/.*"/>
        </list>
        <parameter key="max_pages" value="6"/>
        <parameter key="max_depth" value="4"/>
        <parameter key="domain" value="server"/>
        <parameter key="delay" value="5000"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/14.0.1"/>
        <process expanded="true" height="620" width="433">
          <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="179" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Fighter" value="//h:div[@class='Resume']/h:h1/text()"/>
              <parameter key="Pro Record" value="//h:*[contains(.,'Pro Record:')]/../h:td[last()]/text()"/>
              <parameter key="Team" value="//h:*[contains(.,'Team:')]/../h:td[last()]/text()"/>
              <parameter key="Age" value="//h:*[contains(.,'Age:')]/../h:td[last()]/text()"/>
              <parameter key="Sex" value="//h:*[contains(.,'Sex:')]/../h:td[last()]/text()"/>
              <parameter key="Height" value="//h:*[contains(.,'Height:')]/../h:td[last()]/text()"/>
              <parameter key="Weight" value="//h:*[contains(.,'Weight:')]/../h:td[last()]/text()"/>
              <parameter key="Out of" value="//h:*[contains(.,'Out of:')]/../h:td[last()]/text()"/>
              <parameter key="From" value="//h:*[contains(.,'Born:')]/../h:td[last()]/text()"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="write_excel" compatibility="5.2.008" expanded="true" height="76" name="Write Excel" width="90" x="313" y="30">
        <parameter key="excel_file" value="C:\Users\nwoehler\Desktop\Results.xls"/>
      </operator>
      <connect from_op="Process Documents from Web" from_port="example set" to_op="Write Excel" to_port="input"/>
      <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="18"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Best,
Nils

jx820 · July 2012

That worked perfectly, and it was much easier than expected. Thank you.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

(Solved) Removing tags from extracted data

Answers