Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

(Solved) Removing tags from extracted data

jx820jx820 Member Posts: 7 Contributor II
edited November 2018 in Help
I'm very new and starting with a scraping process. It really doesn't have a function, I'm just playing around trying to learn. My process was originally based on Neil McGuigan's tutorials on Vancouver Data Blog, but as I try new things it's grown a bit.

Currently I'm crawling with the Process Documents from Web operator and using Extract Information as a sub process. I'm querying 9 attributes with xpath querys. Last I use Write Excel to output the data into a spreadsheet. All of that works fine.

The problem is the information extracted contains HTML tags, specifically H1 and TD tags and I can't find a means of removing them. I've tried an Extract Content operator, Remove Document Parts, and Replace. So far nothing has worked.

This is what a typical result looks like:
<td xmlns="http://www.w3.org/1999/xhtml" colspan="1" rowspan="1">33</td>
But all I need is the 33.



Here's the XML behind my process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <parameter key="logfile" value="C:\Users\Public\Documents\Rapidminer Repository\logfile"/>
    <parameter key="resultfile" value="C:\Users\Public\Documents\Rapidminer Repository\resultfile"/>
    <process expanded="true" height="620" width="435">
      <operator activated="true" class="web:process_web" compatibility="5.2.001" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="30">
        <parameter key="url" value="http://www.mixedmartialarts.com/f/1BC00DA3949506AC/BJ-Penn/"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value="http://www\.mixedmartialarts\.com/f/.*"/>
        </list>
        <parameter key="max_pages" value="6"/>
        <parameter key="max_depth" value="4"/>
        <parameter key="domain" value="server"/>
        <parameter key="delay" value="5000"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/14.0.1"/>
        <parameter key="parallelize_process_webpage" value="true"/>
        <process expanded="true" height="620" width="433">
          <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="179" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Fighter" value="//h:div[@class='Resume']/h:h1"/>
              <parameter key="Pro Record" value="//h:*[contains(.,'Pro Record:')]/../h:td[last()]"/>
              <parameter key="Team" value="//h:*[contains(.,'Team:')]/../h:td[last()]"/>
              <parameter key="Age" value="//h:*[contains(.,'Age:')]/../h:td[last()]"/>
              <parameter key="Sex" value="//h:*[contains(.,'Sex:')]/../h:td[last()]"/>
              <parameter key="Height" value="//h:*[contains(.,'Height:')]/../h:td[last()]"/>
              <parameter key="Weight" value="//h:*[contains(.,'Weight:')]/../h:td[last()]"/>
              <parameter key="Out of" value="//h:*[contains(.,'Out of:')]/../h:td[last()]"/>
              <parameter key="From" value="//h:*[contains(.,'Born:')]/../h:td[last()]"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="write_excel" compatibility="5.2.008" expanded="true" height="76" name="Write Excel" width="90" x="313" y="30">
        <parameter key="excel_file" value="C:\Users\Public\Documents\Rapidminer Repository\Results\Results.xls"/>
      </operator>
      <connect from_op="Process Documents from Web" from_port="example set" to_op="Write Excel" to_port="input"/>
      <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="18"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
I checked the FAQ, the tutorials, and searched the forums, but I haven't found anything. Any suggestions?

Answers

  • Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    Hi,

    you can use the XPath text() function:


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <parameter key="logfile" value="C:\Users\Public\Documents\Rapidminer Repository\logfile"/>
        <parameter key="resultfile" value="C:\Users\Public\Documents\Rapidminer Repository\resultfile"/>
        <process expanded="true" height="620" width="435">
          <operator activated="true" class="web:process_web" compatibility="5.2.001" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="30">
            <parameter key="url" value="http://www.mixedmartialarts.com/f/1BC00DA3949506AC/BJ-Penn/"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_url" value="http://www\.mixedmartialarts\.com/f/.*"/>
            </list>
            <parameter key="max_pages" value="6"/>
            <parameter key="max_depth" value="4"/>
            <parameter key="domain" value="server"/>
            <parameter key="delay" value="5000"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/14.0.1"/>
            <process expanded="true" height="620" width="433">
              <operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="179" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="Fighter" value="//h:div[@class='Resume']/h:h1/text()"/>
                  <parameter key="Pro Record" value="//h:*[contains(.,'Pro Record:')]/../h:td[last()]/text()"/>
                  <parameter key="Team" value="//h:*[contains(.,'Team:')]/../h:td[last()]/text()"/>
                  <parameter key="Age" value="//h:*[contains(.,'Age:')]/../h:td[last()]/text()"/>
                  <parameter key="Sex" value="//h:*[contains(.,'Sex:')]/../h:td[last()]/text()"/>
                  <parameter key="Height" value="//h:*[contains(.,'Height:')]/../h:td[last()]/text()"/>
                  <parameter key="Weight" value="//h:*[contains(.,'Weight:')]/../h:td[last()]/text()"/>
                  <parameter key="Out of" value="//h:*[contains(.,'Out of:')]/../h:td[last()]/text()"/>
                  <parameter key="From" value="//h:*[contains(.,'Born:')]/../h:td[last()]/text()"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="write_excel" compatibility="5.2.008" expanded="true" height="76" name="Write Excel" width="90" x="313" y="30">
            <parameter key="excel_file" value="C:\Users\nwoehler\Desktop\Results.xls"/>
          </operator>
          <connect from_op="Process Documents from Web" from_port="example set" to_op="Write Excel" to_port="input"/>
          <connect from_op="Write Excel" from_port="through" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="18"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Best,
    Nils
  • jx820jx820 Member Posts: 7 Contributor II
    That worked perfectly, and it was much easier than expected. Thank you.
Sign In or Register to comment.