Options

XPath returns empty values

OlliSchulzOlliSchulz Member Posts: 2 Contributor I
edited November 2018 in Help
Hello everyone,

I just started to use RapidMiner and so far, it's doing everything I want it to do. However, I have encountered a problem, which I can't really solve by myself.

I mined a lot of html files and want to extract certain data by using XPath. I am using the "Process Documents from Files" operator, combined with the "Extract information" operator.
I want to extract data for the attributes "Datum", "Zeit", "Titel" and "Link". I receive correct values for 3 out of 4 attributes. However, I dont receive any values for the attribute "Titel".
I tried different XPath commands but non of them works.

I hope you can help me with this small problem.

Please find my RapidMiner settings and the structure of the html file I want to extract data from below:

RapidMiner settings
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
    <process expanded="true" height="374" width="434">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.1.002" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="208">
        <list key="text_directories">
          <parameter key="63DU" value="C:\Users\Admin\Desktop\RapidMiner\63DU"/>
          <parameter key="ADS" value="C:\Users\Admin\Desktop\RapidMiner\ADS"/>
          <parameter key="ALV" value="C:\Users\Admin\Desktop\RapidMiner\ALV"/>
          <parameter key="BAS" value="C:\Users\Admin\Desktop\RapidMiner\BAS"/>
          <parameter key="BAY" value="C:\Users\Admin\Desktop\RapidMiner\BAY"/>
          <parameter key="BEI" value="C:\Users\Admin\Desktop\RapidMiner\BEI"/>
          <parameter key="BMW" value="C:\Users\Admin\Desktop\RapidMiner\BMW"/>
          <parameter key="CBK" value="C:\Users\Admin\Desktop\RapidMiner\CBK"/>
          <parameter key="DAI" value="C:\Users\Admin\Desktop\RapidMiner\DAI"/>
          <parameter key="DBK" value="C:\Users\Admin\Desktop\RapidMiner\DBK"/>
          <parameter key="DPW" value="C:\Users\Admin\Desktop\RapidMiner\DPW"/>
          <parameter key="DTE" value="C:\Users\Admin\Desktop\RapidMiner\DTE"/>
          <parameter key="EOAN" value="C:\Users\Admin\Desktop\RapidMiner\EOAN"/>
          <parameter key="FME" value="C:\Users\Admin\Desktop\RapidMiner\FME"/>
          <parameter key="FRE" value="C:\Users\Admin\Desktop\RapidMiner\FRE"/>
          <parameter key="HEI" value="C:\Users\Admin\Desktop\RapidMiner\HEI"/>
          <parameter key="HEN3" value="C:\Users\Admin\Desktop\RapidMiner\HEN3"/>
          <parameter key="IFX" value="C:\Users\Admin\Desktop\RapidMiner\IFX"/>
          <parameter key="LHA" value="C:\Users\Admin\Desktop\RapidMiner\LHA"/>
          <parameter key="LIN" value="C:\Users\Admin\Desktop\RapidMiner\LIN"/>
          <parameter key="MAN" value="C:\Users\Admin\Desktop\RapidMiner\MAN"/>
          <parameter key="MEO" value="C:\Users\Admin\Desktop\RapidMiner\MEO"/>
          <parameter key="MRK" value="C:\Users\Admin\Desktop\RapidMiner\MRK"/>
          <parameter key="MUV2" value="C:\Users\Admin\Desktop\RapidMiner\MUV2"/>
          <parameter key="RWE" value="C:\Users\Admin\Desktop\RapidMiner\RWE"/>
          <parameter key="SAP" value="C:\Users\Admin\Desktop\RapidMiner\SAP"/>
          <parameter key="SDF" value="C:\Users\Admin\Desktop\RapidMiner\SDF"/>
          <parameter key="SIE" value="C:\Users\Admin\Desktop\RapidMiner\SIE"/>
          <parameter key="TKA" value="C:\Users\Admin\Desktop\RapidMiner\TKA"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="create_word_vector" value="false"/>
        <process expanded="true" height="392" width="452">
          <operator activated="true" class="text:extract_information" compatibility="5.1.002" expanded="true" height="60" name="Extract Information" width="90" x="112" y="165">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Datum" value="//h:td[@class='DATUM']/text()"/>
              <parameter key="Zeit" value="//h:td[@class='ZEIT']/text()"/>
              <parameter key="Titel" value="//h:td[@class='ARTIKEL_TITEL']/text()"/>
              <parameter key="Link" value="//h:td[@class='ARTIKEL_TITEL']/h:a/@href"/&gt;
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>
Extract from html file
<table>
  <colgroup>
  <col class="DATUM" />
  <col class="ZEIT" />
  <col class="NEWS" />
  </colgroup>
  <thead>
    <tr>
      <th class="DATUM"> Datum </th>
      <th class="ZEIT"> Zeit </th>
      <th class="ARTIKEL_TITEL"> News </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td class="DATUM"> 07.09. </td>
      <td class="ZEIT"> 18:06 </td>
      <td class="ARTIKEL_TITEL"><a href="http://www.onvista.de/news/unternehmensberichte/artikel/07.09.2011-18:06:10-roundup-aktien-frankfurt-schluss-sehr-fest-dax-profitiert-von-bvg-entscheidung?suche=496b0ceba408ca796b867195c2b6dfe5"  title="ROUNDUP/Aktien Frankfurt Schluss: Sehr fest; Dax profitiert von BVG-Entscheidung"> ROUNDUP/Aktien Frankfurt Schluss: Sehr fest; Dax profitiert von BVG-En... </a></td>
    </tr>
    <tr class=&quot;HERVORGEHOBEN&quot;>
      <td class="DATUM"> 07.09. </td>
      <td class="ZEIT"> 15:58 </td>
      <td class="ARTIKEL_TITEL"><a href="http://www.onvista.de/news/unternehmensberichte/artikel/07.09.2011-15:58:08-roundup-4-saab-beantragt-glaeubigerschutz-das-aus-rueckt-immer-naeher?suche=496b0ceba408ca796b867195c2b6dfe5" > ROUNDUP 4: Saab beantragt Gläubigerschutz: Das Aus rückt immer näher </a></td>
    </tr>
  </tbody>
</table>
Many thanks in advance!

Greetings,
Olli

Answers

  • Options
    OlliSchulzOlliSchulz Member Posts: 2 Contributor I
    Problem solved!

    Correct XPath string for the "Titel" attribute is //h:td[@class='ARTIKEL_TITEL']/h:a/text()
Sign In or Register to comment.