How can I crawl more than one web page?

pix123pix123 Member Posts: 27 Contributor II
Hi there, I am looking to collect the text data about a movie review, there are several pages of reviews and I would like to collect the first 10. I have set up a very basic web crawler as I want to get the data in txt data to do some text pre-processing and mining instead of crawling each time. However I only seem to pick up on the first page of reviews, please can you take a look and advise?

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="9.0.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="75">
        <parameter key="url" value="https://www.rottentomatoes.com/m/chef_2014/reviews/"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".*chef_2014.*"/>
          <parameter key="follow_link_with_matching_url" value=".*chef_2014.*"/>
        </list>
        <parameter key="output_dir" value="C:\rottentomatoes reviews &amp; Clustering\Rapidminer Output"/>
        <parameter key="max_pages" value="10"/>
        <parameter key="max_depth" value="4"/>
        <parameter key="max_page_size" value="1000"/>
        <parameter key="user_agent" value="test"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>


Tagged:
0
0 votes

Declined · Last Updated

Unfortunately updating the Web Mining extension is not on the current roadmap in the foreseeable future. Please comment and cc sgenzer if this is a pressing issue. WE-43

Comments

  • pix123pix123 Member Posts: 27 Contributor II
    @Telcontar120 Thank you for this
  • pix123pix123 Member Posts: 27 Contributor II
    @Telcontar120 Is there a way to export the ISOOObject Collection files to CSV?
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    @Telcontar120 - AFAIK there are no plans to update the web mining extension in the near-to-medium future BUT there is a newly-certified RapidMiner Expert I know who lives in a gloriously beautiful country called Chile who may be coerced into porting Selenium into a RapidMiner extension.. :wink:@rfuentealba

    Scott

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    I'd loooove a Selenium extension for RapidMiner!  That would be epic @rfuentealba !
    FYI, @sgenzer I did confirm with Helge that this is a bug with the Crawl Web operator.  It looks like it is related to https pages (which is a shame since that is like 90% of the web these days).

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    I have good news: the Selenium extension for RapidMiner is in pre-alpha stage, we at Pegasus are investing a lot of time in building something that can, for now, duplicate the behavior of what already exists on RapidMiner. As we spoke with @sgenzer at RapidMiner Wisdom is that the extension will have blocks that perform certain actions such as "clicking on certain button", "retrieving the content from certain element, class or id", "store the content into a document", "waiting for the site to be ready before doing such things", "going back to the previous page". So you can build your navigation flow.

    It is taking me ages because of my current travel plans (guess what? more delays!), but I have plans to release it at some point in January.
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    :smiley::smiley::smiley::smiley:
  • KPLKPL RapidMiner Certified Analyst, Member Posts: 9 Contributor II
    edited December 2018
    @Telcontar120 - tried out your workaround above with latest RM Studio 9.0.003 Large Ed. Process failed with:
    • Exception: java.lang.NoClassDefFoundError
    Please ignore - solved the problem, unrelated to new RM version
  • FunkFunk Member Posts: 3 Contributor I
    edited April 2019
    Hello!

    I have two questions regarding @Telcontar120 's solution. How do I set it up if I use [Get Pages] instead of [Get Page]?

    1) If I use a .txt file with links, e.g. from the above process,

    (edit with link correct link)

    the loop operator unfortunately does not increase the page number and the [Data to Documents]-Operator only results in the first page being crawled twice. Notice that in the .txt-file I put in "%{iteration}", however this seems to be ignored by the loop operator.

    2) As already asked by pix123, how do I export the results of the [Data to Documents]-Operator into a .txt, .csv or Excel file?

    My process:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:loop" compatibility="9.2.001" expanded="true" height="82" name="Loop" width="90" x="112" y="136">
            <parameter key="number_of_iterations" value="2"/>
            <parameter key="iteration_macro" value="iteration"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="false"/>
            <process expanded="true">
              <operator activated="true" class="read_csv" compatibility="9.2.001" expanded="true" height="68" name="Read CSV" width="90" x="112" y="187">
                <parameter key="csv_file" value="C:\Users\Funk\Desktop\pages_rotten.txt"/>
                <parameter key="column_separators" value=";"/>
                <parameter key="trim_lines" value="false"/>
                <parameter key="use_quotes" value="true"/>
                <parameter key="quotes_character" value="&quot;"/>
                <parameter key="escape_character" value="\"/>
                <parameter key="skip_comments" value="true"/>
                <parameter key="comment_characters" value="#"/>
                <parameter key="starting_row" value="1"/>
                <parameter key="parse_numbers" value="true"/>
                <parameter key="decimal_character" value="."/>
                <parameter key="grouped_digits" value="false"/>
                <parameter key="grouping_character" value=","/>
                <parameter key="infinity_representation" value=""/>
                <parameter key="date_format" value=""/>
                <parameter key="first_row_as_names" value="true"/>
                <list key="annotations"/>
                <parameter key="time_zone" value="SYSTEM"/>
                <parameter key="locale" value="English (United States)"/>
                <parameter key="encoding" value="windows-1252"/>
                <parameter key="read_all_values_as_polynominal" value="false"/>
                <list key="data_set_meta_data_information">
                  <parameter key="0" value="LINKS.true.polynominal.attribute"/>
                </list>
                <parameter key="read_not_matching_values_as_missings" value="false"/>
                <parameter key="datamanagement" value="double_array"/>
                <parameter key="data_management" value="auto"/>
              </operator>
              <operator activated="true" class="web:retrieve_webpages" compatibility="9.0.000" expanded="true" height="68" name="Get Pages" width="90" x="246" y="187">
                <parameter key="link_attribute" value="LINKS"/>
                <parameter key="random_user_agent" value="false"/>
                <parameter key="connection_timeout" value="10000"/>
                <parameter key="read_timeout" value="10000"/>
                <parameter key="follow_redirects" value="true"/>
                <parameter key="accept_cookies" value="none"/>
                <parameter key="cookie_scope" value="global"/>
                <parameter key="request_method" value="GET"/>
                <parameter key="delay" value="none"/>
                <parameter key="delay_amount" value="1000"/>
                <parameter key="min_delay_amount" value="0"/>
                <parameter key="max_delay_amount" value="1000"/>
              </operator>
              <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="380" y="187">
                <parameter key="select_attributes_and_weights" value="false"/>
                <list key="specify_weights"/>
              </operator>
              <connect from_op="Read CSV" from_port="output" to_op="Get Pages" to_port="Example Set"/>
              <connect from_op="Get Pages" from_port="Example Set" to_op="Data to Documents" to_port="example set"/>
              <connect from_op="Data to Documents" from_port="documents" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Loop" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>





  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @Funk - just boosted you to Contributor I so you will have no more posting issues. Feel free to add your links.

    Scott

  • FunkFunk Member Posts: 3 Contributor I
    edited April 2019
    Thanks sgenzer!

    The link is the same as in Telcontar120's post, only modified at the end and included in a .txt-file having links as attribute so it can be read from the [Read CSV]-Operator (see my process above):

    Links
    https://www.rottentomatoes.com/m/chef_2014/reviews/?page=%{iteration}

    Anyone knows how to iterate via the [Get Pages]-Operator correctly?
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    You don't need to use Get Pages and put this in a text file--you can simply put the Get Page inside a Loop operator and then use the iteration macro to fill in the page number you need at the end of the URL.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • FunkFunk Member Posts: 3 Contributor I
    edited April 2019
    Thanks, Telcontar120. However I thought that if you crawl multiple pages at once the [Get Pages] would be more practicable since you'd have only a .txt file containing these. But I guess setting up multiple [Loop]-Operators with [Get Page] inside will do it too, albeit a bit more cumbersome.

    Okay, I'm making progress via [Process Documents] and [Write CSV] regarding the question of how to extract IOObjectCollection.
  • joeanalyticajoeanalytica Member Posts: 7 Contributor II
    edited August 2019
    Hi Telcontar120: Thank you for the contribution. 
    I was wondering how to apply the same for a job post site - like Indeed. As I'm trying to follow along with one of the Academy lessons (regards text analytics). My scenario would be to crawl a job post site for a job title - say: "Data Scientist". Now since the crawl-web operator in rapidminer has issues - I thought maybe you could step in and help out. Much appreciated. Thanks 
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    @joeanalytica Here's a quick summary of the current state as far as I know it for web mining in RapidMiner. I hope this is helpful.
    1. Crawling where the desired page addresses change in simple and predictable ways (e.g., the example given above where you just have to update a page number) can be done easily with Loop and Get Page.
    2. Crawling where you have a specific set of pages you want to get can be done easily with Get Pages but you need to have a text file with all the URLs you want to retrieve.
    3. Crawling based on search criteria where you don't know in advance the specific URLs that will satisfy your criteria, or where you want to dynamically follow links from one page to another, is difficult right now using the web mining extension because of the problem described above with the Crawl Web operator.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.