How can I crawl more than one web page?

pix123 · December 2018

Hi there, I am looking to collect the text data about a movie review, there are several pages of reviews and I would like to collect the first 10. I have set up a very basic web crawler as I want to get the data in txt data to do some text pre-processing and mining instead of crawling each time. However I only seem to pick up on the first page of reviews, please can you take a look and advise?

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="9.0.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="75">
        <parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".*chef_2014.*"/>
          <parameter key="follow_link_with_matching_url" value=".*chef_2014.*"/>
        </list>
        <parameter key="output_dir" value="C:\rottentomatoes reviews & Clustering\Rapidminer Output"/>
        <parameter key="max_pages" value="10"/>
        <parameter key="max_depth" value="4"/>
        <parameter key="max_page_size" value="1000"/>
        <parameter key="user_agent" value="test"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
</operator>
</process>

Telcontar120 · December 2018

I think this is a problem with the Crawl Web operator. I've noticed similar issues before myself.
Here is a way of doing it that uses Get Page inside a loop, and that works just fine.

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="false" class="web:crawl_web_modern" compatibility="9.0.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="34">
        <parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/?page=."/>
        </list>
        <parameter key="max_crawl_depth" value="4"/>
        <parameter key="retrieve_as_html" value="true"/>
        <parameter key="add_content_as_attribute" value="true"/>
        <parameter key="max_pages" value="10"/>
      </operator>
      <operator activated="true" class="concurrency:loop" compatibility="9.0.003" expanded="true" height="82" name="Loop" width="90" x="112" y="136">
        <parameter key="number_of_iterations" value="10"/>
        <process expanded="true">
          <operator activated="true" class="web:get_webpage" compatibility="9.0.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="136">
            <parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/?page=%{iteration}"/>
            <list key="query_parameters"/>
            <list key="request_properties"/>
          </operator>
          <connect from_op="Get Page" from_port="output" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Loop" from_port="output 1" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

@sgenzer can you flag this issue with Crawl Web for the developers to look into? I've reported similar problems in the past and I was hoping they would have been fixed in the recently released version of the Web Mining Extension, but it doesn't look like that is the case. Take a look at the attached process, in which the disabled Crawl Web operator should return results that are the same as the Loop/Get Pages, but it just returns an empty set.

pix123 · December 2018

@Telcontar120 Thank you for this

pix123 · December 2018

@Telcontar120 Is there a way to export the ISOOObject Collection files to CSV?

sgenzer · December 2018

@Telcontar120 - AFAIK there are no plans to update the web mining extension in the near-to-medium future BUT there is a newly-certified RapidMiner Expert I know who lives in a gloriously beautiful country called Chile who may be coerced into porting Selenium into a RapidMiner extension..

@rfuentealba

Scott

Telcontar120 · December 2018

I'd loooove a Selenium extension for RapidMiner! That would be epic @rfuentealba !
FYI, @sgenzer I did confirm with Helge that this is a bug with the Crawl Web operator. It looks like it is related to https pages (which is a shame since that is like 90% of the web these days).

rfuentealba · December 2018

I have good news: the Selenium extension for RapidMiner is in pre-alpha stage, we at Pegasus are investing a lot of time in building something that can, for now, duplicate the behavior of what already exists on RapidMiner. As we spoke with @sgenzer at RapidMiner Wisdom is that the extension will have blocks that perform certain actions such as "clicking on certain button", "retrieving the content from certain element, class or id", "store the content into a document", "waiting for the site to be ready before doing such things", "going back to the previous page". So you can build your navigation flow.

It is taking me ages because of my current travel plans (guess what? more delays!), but I have plans to release it at some point in January.

sgenzer · December 2018

KPL · December 2018

@Telcontar120 - tried out your workaround above with latest RM Studio 9.0.003 Large Ed. Process failed with:

Exception: java.lang.NoClassDefFoundError

Please ignore - solved the problem, unrelated to new RM version

Funk · April 2019

Hello!

I have two questions regarding @Telcontar120 's solution. How do I set it up if I use [Get Pages] instead of [Get Page]?

1) If I use a .txt file with links, e.g. from the above process,

(edit with link correct link)

Links

https://www.rottentomatoes.com/m/chef_2014/reviews/?page=%{iteration}

the loop operator unfortunately does not increase the page number and the [Data to Documents]-Operator only results in the first page being crawled twice. Notice that in the .txt-file I put in "%{iteration}", however this seems to be ignored by the loop operator.

2) As already asked by pix123, how do I export the results of the [Data to Documents]-Operator into a .txt, .csv or Excel file?

My process:

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="concurrency:loop" compatibility="9.2.001" expanded="true" height="82" name="Loop" width="90" x="112" y="136">
        <parameter key="number_of_iterations" value="2"/>
        <parameter key="iteration_macro" value="iteration"/>
        <parameter key="reuse_results" value="false"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="read_csv" compatibility="9.2.001" expanded="true" height="68" name="Read CSV" width="90" x="112" y="187">
            <parameter key="csv_file" value="C:\Users\Funk\Desktop\pages_rotten.txt"/>
            <parameter key="column_separators" value=";"/>
            <parameter key="trim_lines" value="false"/>
            <parameter key="use_quotes" value="true"/>
            <parameter key="quotes_character" value="""/>
            <parameter key="escape_character" value="\"/>
            <parameter key="skip_comments" value="true"/>
            <parameter key="comment_characters" value="#"/>
            <parameter key="starting_row" value="1"/>
            <parameter key="parse_numbers" value="true"/>
            <parameter key="decimal_character" value="."/>
            <parameter key="grouped_digits" value="false"/>
            <parameter key="grouping_character" value=","/>
            <parameter key="infinity_representation" value=""/>
            <parameter key="date_format" value=""/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="encoding" value="windows-1252"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="LINKS.true.polynominal.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="web:retrieve_webpages" compatibility="9.0.000" expanded="true" height="68" name="Get Pages" width="90" x="246" y="187">
            <parameter key="link_attribute" value="LINKS"/>
            <parameter key="random_user_agent" value="false"/>
            <parameter key="connection_timeout" value="10000"/>
            <parameter key="read_timeout" value="10000"/>
            <parameter key="follow_redirects" value="true"/>
            <parameter key="accept_cookies" value="none"/>
            <parameter key="cookie_scope" value="global"/>
            <parameter key="request_method" value="GET"/>
            <parameter key="delay" value="none"/>
            <parameter key="delay_amount" value="1000"/>
            <parameter key="min_delay_amount" value="0"/>
            <parameter key="max_delay_amount" value="1000"/>
          </operator>
          <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="380" y="187">
            <parameter key="select_attributes_and_weights" value="false"/>
            <list key="specify_weights"/>
          </operator>
          <connect from_op="Read CSV" from_port="output" to_op="Get Pages" to_port="Example Set"/>
          <connect from_op="Get Pages" from_port="Example Set" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Loop" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
</operator>
</process>

sgenzer · April 2019

hi @Funk - just boosted you to Contributor I so you will have no more posting issues. Feel free to add your links.

Scott

Funk · April 2019

Thanks sgenzer!

The link is the same as in Telcontar120's post, only modified at the end and included in a .txt-file having links as attribute so it can be read from the [Read CSV]-Operator (see my process above):

Links

https://www.rottentomatoes.com/m/chef_2014/reviews/?page=%{iteration}

Anyone knows how to iterate via the [Get Pages]-Operator correctly?

Telcontar120 · April 2019

You don't need to use Get Pages and put this in a text file--you can simply put the Get Page inside a Loop operator and then use the iteration macro to fill in the page number you need at the end of the URL.

Funk · April 2019

Thanks, Telcontar120. However I thought that if you crawl multiple pages at once the [Get Pages] would be more practicable since you'd have only a .txt file containing these. But I guess setting up multiple [Loop]-Operators with [Get Page] inside will do it too, albeit a bit more cumbersome.

Okay, I'm making progress via [Process Documents] and [Write CSV] regarding the question of how to extract IOObjectCollection.

joeanalytica · August 2019

Hi Telcontar120: Thank you for the contribution.
I was wondering how to apply the same for a job post site - like Indeed. As I'm trying to follow along with one of the Academy lessons (regards text analytics). My scenario would be to crawl a job post site for a job title - say: "Data Scientist". Now since the crawl-web operator in rapidminer has issues - I thought maybe you could step in and help out. Much appreciated. Thanks

Telcontar120 · August 2019

@joeanalytica Here's a quick summary of the current state as far as I know it for web mining in RapidMiner. I hope this is helpful.

Crawling where the desired page addresses change in simple and predictable ways (e.g., the example given above where you just have to update a page number) can be done easily with Loop and Get Page.
Crawling where you have a specific set of pages you want to get can be done easily with Get Pages but you need to have a text file with all the URLs you want to retrieve.
Crawling based on search criteria where you don't know in advance the specific URLs that will satisfy your criteria, or where you want to dynamically follow links from one page to another, is difficult right now using the web mining extension because of the problem described above with the Crawl Web operator.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How can I crawl more than one web page?

Declined · Last Updated October 2019

Comments