"Can I crawl websites in java script using rapidminer"

ArnoGArnoG Member Posts: 22 Contributor II
edited June 2019 in Help
I have a problem crawling a website. I believe the problem is that the website is build in javascript. Is it possible to crawl such a page using rapidminer? 

For example: http://www.booking.com/hotel/nl/easyhotel-amsterdam.nl.html?sid=9fc05dc001129cc3698397a2efbfba2f;dcid=1#hash-blockdisplay4

When I use the Crawl web operator i only creates two files. The files leads to the startingpage of the hotel, not the review page. While I use the reviewpage as URL in the operator.

How can I crawl this website?

Thanks Arno
Tagged:

Answers

  • ArnoGArnoG Member Posts: 22 Contributor II
    The process I created so far leads me to the starting page of a specific hotel and not to the review page.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="30">
            <parameter key="url" value="http://www.booking.com/hotel/nl/easyhotel-amsterdam.nl.html?sid=9fc05dc001129cc3698397a2efbfba2f;dcid=1#hash-blockdisplay4"/>
            <list key="crawling_rules">
              <parameter key="store_with_matching_url" value=".+/easyhotel-amsterdam.nl..+"/>
              <parameter key="follow_link_with_matching_text" value=".+/easyhotel-amsterdam.nl..+|#hash-blockdisplay4"/>
            </list>
            <parameter key="output_dir" value="C:\Improve Your Business\Qing\Pilot\test\crawlbooking.com"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_pages" value="1000"/>
            <parameter key="max_depth" value="18"/>
            <parameter key="max_page_size" value="100000"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    Hi Arno,

    unfortunately at the moment this is not possible.

    Best,
    Nils
  • ArnoGArnoG Member Posts: 22 Contributor II
    Hi Nils,
    Thanks for your answer. Maybe a functionality in the next releases. More and more websites are using javascript.
    I crawled the webites using 'Mozenda', works perfectly!

    Regards, Arno
Sign In or Register to comment.