XPath with "Cut Document" or "Extract Information" with "?"-result

miner · January 2018

Dear RM-experts,

I´m struggling trying to extract certain information from websites I crawled.

My process is as follows:

I have a "Crawl web" operator crawling websites in a loop. This process works fine (tested with up to 17 iterations).

The crawled web pages are stored as html-files (one file for each site).

Now I want to get a specific information from these websites for which I have an XPath-statement, that works fine on google spreadsheet but not in RM. I tried the process with the recommended "Cut Document"-operator and with the "Extract Information"-operator within a "Process Documents from Files"-Process.

I already searched the forum and tried all possible versions of "//h:" and "assume html" - knowing that the syntax in RM is slightly different - but with no success.

Is anybody out there with a solution for this issue?

Here is my current process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <parameter key="logverbosity" value="all"/>
    <process expanded="true">
      <operator activated="false" class="concurrency:loop" compatibility="7.5.003" expanded="true" height="82" name="Loop" width="90" x="246" y="34">
        <parameter key="number_of_iterations" value="2"/>
        <parameter key="reuse_results" value="true"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="34">
            <parameter key="url" value="https://jobs.meinestadt.de/deutschland/suche?words=Zollabwicklung&amp;amp;page=%{iteration}#ms-jobs-result-list"/>
            <list key="crawling_rules">
              <parameter key="store_with_matching_url" value=".+standard.+"/>
              <parameter key="follow_link_with_matching_url" value=".+standard.*"/>
            </list>
            <parameter key="retrieve_as_html" value="true"/>
            <parameter key="write_pages_to_disk" value="true"/>
            <parameter key="output_dir" value="\\xxx\homes\xxx\Tools\RapidMiner\jobs.meinestadt\Zollabwicklung\Sites"/>
            <parameter key="output_file_extension" value="%{iteration}.html"/>
            <parameter key="max_pages" value="20"/>
            <parameter key="max_page_size" value="100"/>
            <parameter key="delay" value="1000"/>
            <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"/>
          </operator>
          <connect from_op="Crawl Web" from_port="example set" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="246" y="136">
        <list key="text_directories">
          <parameter key="all" value="\\xxx\homes\xxx\Tools\RapidMiner\jobs.meinestadt\Zollabwicklung\Sites"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="use_file_extension_as_type" value="false"/>
        <parameter key="content_type" value="html"/>
        <parameter key="encoding" value="UTF-8"/>
        <parameter key="create_word_vector" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="34">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="Branche" value="//*[@id=&amp;quot;ms-maincontent&quot;]/div[1]/div[1]/div/div//h4[contains(text(),'Arbeitgeber')]/following-sibling::p[2]/text()"/>
            </list>
            <list key="namespaces"/>
            <parameter key="assume_html" value="false"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_port="input 1" to_op="Process Documents from Files" to_port="word list"/>
      <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="source_input 2" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Thanks for your support.

kayman · January 2018

just add text(), like this :

//*[@id='ms-maincontent']/h:div[1]/h:div[1]/h:div/h:div//h:h4[contains(.,'Arbeitgeber')]/../h:p[2]/text()

kayman · January 2018

I've done a quick test with following page :

https://jobs.meinestadt.de/deutschland/suche?words=Zollabwicklung&page=1#ms-jobs-result-list

Which is basically the first page you would grab using the logic. On this page there is no h4 that contains the text Arbeitgeber hence why you get no results.

Apart from that you need to add the h: for every element since all of them are using the same html namespace. Below example will show you the match till the 4th div, as from there your Xpath does not match anything anymore. This may be because of the page I used so it could work for you.

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="all"/>
    <process expanded="true">
      <operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="34">
        <parameter key="url" value="https://jobs.meinestadt.de/deutschland/suche?words=Zollabwicklung&amp;page=1#ms-jobs-result-list"/>
        <list key="query_parameters"/>
        <list key="request_properties"/>
      </operator>
      <operator activated="true" class="text:html_to_xml" compatibility="7.5.000" expanded="true" height="68" name="HTML to XML" width="90" x="246" y="34"/>
      <operator activated="true" class="text:extract_information" compatibility="7.5.000" expanded="true" height="68" name="Extract Information" width="90" x="380" y="34">
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="Branche" value="//*[@id='ms-maincontent']/h:div[1]/h:div[1]/h:div/h:div"/>
        </list>
        <list key="namespaces"/>
        <parameter key="assume_html" value="false"/>
        <list key="index_queries"/>
        <list key="jsonpath_queries"/>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data" width="90" x="514" y="34">
        <parameter key="text_attribute" value="txt"/>
      </operator>
      <connect from_op="Get Page" from_port="output" to_op="HTML to XML" to_port="document"/>
      <connect from_op="HTML to XML" from_port="document" to_op="Extract Information" to_port="document"/>
      <connect from_op="Extract Information" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="source_input 2" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Hope this helps.

miner · January 2018

Dear Kayman,

thanks for this immediate response.

You´re right for the results page you tested but I am on the specific job page like this:

https://jobs.meinestadt.de/deutschland/standard?id=200880935

to judge, wether a job is posted directly by a company or by an agency for personnel leasing.

On that detail page the XPath //*[@id="ms-maincontent"]/div[1]/div[1]/div/div//h4[contains(text(),'Arbeitgeber')]/following-sibling::p[2] returns as value "Befristete Überlassung von Arbeitskräften" so this is a personnel leasing job posting.

I now tried with

//*[@id="ms-maincontent"]/h:div[1]/h:div[1]/h:div/h:div//h4[contains(text(),'Arbeitgeber')]/following-sibling::p[2]

//*[@id="ms-maincontent"]/h:div[1]/h:div[1]/h:div/h:div//h4[contains(text(),'Arbeitgeber')]/h:following-sibling::h:p[2]

//*[@id="ms-maincontent"]/h:div[1]/h:div[1]/h:div/h:div//h:h4[contains(text(),'Arbeitgeber')]/following-sibling::p[2]

//*[@id="ms-maincontent"]/h:div[1]/h:div[1]/h:div/h:div//h:h4[contains(text(),'Arbeitgeber')]/h:following-sibling::h:p[2]

but with no success. Where did I go wrong?

You used the 'html to xml'-operator. How can I use this operator for stored html-sites?

Thanks

kayman · January 2018

Aaah, found it. Try this :

//*[@id='ms-maincontent']/h:div[1]/h:div[1]/h:div/h:div//h:section[h:h4[contains(.,'Arbeitgeber')]]/h:p[2]

Bit hard to explain, but what you did is actually select the h4, and then travel further to the second p within this node, but your h4 does not have nodes so selecting a sibling has no use. Instead you have to select the element that contains the h4 (in this case the section) and get the second p in the section.

Another way would be to go one step up once you select the h4, and then get the second element as below

//*[@id='ms-maincontent']/h:div[1]/h:div[1]/h:div/h:div//h:h4[contains(.,'Arbeitgeber')]/../h:p[2]

The double dot takes you back to the parent level, but this may be less reliable as the first variation

Don't mind the html to xml operator by the way, I typically use this since I load the xml into another editor and this way I am always ensured the html is proper xhtml.

miner · January 2018

Dear Kayman,

many thanks for that - now it works fine with your first suggestion

Just one tiny little detail: result now is:

"<p xmlns='http://www.w3.org/1999xhtml'>Befristete Überlassung von Arbeitskräften</p>"

Any idea how I can get only to "Befristete Überlassung von Arbeitskräften"?

miner · January 2018

Great support!!

I´m trying to get through the other elements with your sample.

Thank you very much.

17900713r · April 2018

Hello,

Please, may I know how you obtained your process code?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

XPath with "Cut Document" or "Extract Information" with "?"-result

Best Answer

Answers