(Solved) Removing tags from extracted data

jx820jx820 Member Posts: 7 Contributor II
edited November 2018 in Help
I'm very new and starting with a scraping process. It really doesn't have a function, I'm just playing around trying to learn. My process was originally based on Neil McGuigan's tutorials on Vancouver Data Blog, but as I try new things it's grown a bit.

Currently I'm crawling with the Process Documents from Web operator and using Extract Information as a sub process. I'm querying 9 attributes with xpath querys. Last I use Write Excel to output the data into a spreadsheet. All of that works fine.

The problem is the information extracted contains HTML tags, specifically H1 and TD tags and I can't find a means of removing them. I've tried an Extract Content operator, Remove Document Parts, and Replace. So far nothing has worked.

This is what a typical result looks like:
<td xmlns="http://www.w3.org/1999/xhtml" colspan="1" rowspan="1">33</td>
But all I need is the 33.



Here's the XML behind my process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <parameter key="logfile" value="C:\Users\Public\Documents\Rapidminer Repository\logfile"/>
    <parameter key="resultfile" value="C:\Users\Public\Documents\Rapidminer Repository\resultfile"/>
    <process expanded="true" height="620" width="435">
      <operator activated="true" class="web:process_web" compatibility="5.2.001" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="30">
        <parameter key="url" value="http://www.mixedmartialarts.com/f/1BC00DA3949506AC/BJ-Penn/"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value="http://www\.mixedmartialarts\.com/f/.*"/>
        </list>
        <parameter key="max_pages" value="6"/>
        <parameter key="max_depth" value="4"/>
        <parameter key="domain" value="server"/>
        <parameter key="delay" value="5000"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/14.0.1"/>
        <parameter key="parallelize_process_webpage" value="true"/>
        <process expanded="true" height="620" width="433">
          <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="179" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Fighter" value="//h:div[@class='Resume']/h:h1"/>
              <parameter key="Pro Record" value="//h:*[contains(.,'Pro Record:')]/../h:td[last()]"/>
              <parameter key="Team" value="//h:*[contains(.,'Team:')]/../h:td[last()]"/>
              <parameter key="Age" value="//h:*[contains(.,'Age:')]/../h:td[last()]"/>
              <parameter key="Sex" value="//h:*[contains(.,'Sex:')]/../h:td[last()]"/>
              <parameter key="Height" value="//h:*[contains(.,'Height:')]/../h:td[last()]"/>
              <parameter key="Weight" value="//h:*[contains(.,'Weight:')]/../h:td[last()]"/>
              <parameter key="Out of" value="//h:*[contains(.,'Out of:')]/../h:td[last()]"/>
              <parameter key="From" value="//h:*[contains(.,'Born:')]/../h:td[last()]"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="write_excel" compatibility="5.2.008" expanded="true" height="76" name="Write Excel" width="90" x="313" y="30">
        <parameter key="excel_file" value="C:\Users\Public\Documents\Rapidminer Repository\Results\Results.xls"/>
      </operator>
      <connect from_op="Process Documents from Web" from_port="example set" to_op="Write Excel" to_port="input"/>
      <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="18"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
I checked the FAQ, the tutorials, and searched the forums, but I haven't found anything. Any suggestions?

Answers

  • Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    Hi,

    you can use the XPath text() function:


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <parameter key="logfile" value="C:\Users\Public\Documents\Rapidminer Repository\logfile"/>
        <parameter key="resultfile" value="C:\Users\Public\Documents\Rapidminer Repository\resultfile"/>
        <process expanded="true" height="620" width="435">
          <operator activated="true" class="web:process_web" compatibility="5.2.001" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="30">
            <parameter key="url" value="http://www.mixedmartialarts.com/f/1BC00DA3949506AC/BJ-Penn/"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value="http://www\.mixedmartialarts\.com/f/.*"/>
            </list>
            <parameter key="max_pages" value="6"/>
            <parameter key="max_depth" value="4"/>
            <parameter key="domain" value="server"/>
            <parameter key="delay" value="5000"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/14.0.1"/>
            <process expanded="true" height="620" width="433">
              <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="179" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="Fighter" value="//h:div[@class='Resume']/h:h1/text()"/>
                  <parameter key="Pro Record" value="//h:*[contains(.,'Pro Record:')]/../h:td[last()]/text()"/>
                  <parameter key="Team" value="//h:*[contains(.,'Team:')]/../h:td[last()]/text()"/>
                  <parameter key="Age" value="//h:*[contains(.,'Age:')]/../h:td[last()]/text()"/>
                  <parameter key="Sex" value="//h:*[contains(.,'Sex:')]/../h:td[last()]/text()"/>
                  <parameter key="Height" value="//h:*[contains(.,'Height:')]/../h:td[last()]/text()"/>
                  <parameter key="Weight" value="//h:*[contains(.,'Weight:')]/../h:td[last()]/text()"/>
                  <parameter key="Out of" value="//h:*[contains(.,'Out of:')]/../h:td[last()]/text()"/>
                  <parameter key="From" value="//h:*[contains(.,'Born:')]/../h:td[last()]/text()"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="write_excel" compatibility="5.2.008" expanded="true" height="76" name="Write Excel" width="90" x="313" y="30">
            <parameter key="excel_file" value="C:\Users\nwoehler\Desktop\Results.xls"/>
          </operator>
          <connect from_op="Process Documents from Web" from_port="example set" to_op="Write Excel" to_port="input"/>
          <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="18"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Best,
    Nils
  • jx820jx820 Member Posts: 7 Contributor II
    That worked perfectly, and it was much easier than expected. Thank you.
Sign In or Register to comment.