How can I crawl more than one web page?

pix123pix123 Member Posts: 27 Contributor I
edited January 15 in Help
Hi there, I am looking to collect the text data about a movie review, there are several pages of reviews and I would like to collect the first 10. I have set up a very basic web crawler as I want to get the data in txt data to do some text pre-processing and mining instead of crawling each time. However I only seem to pick up on the first page of reviews, please can you take a look and advise?

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="9.0.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="75">
        <parameter key="url" value=""/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".*chef_2014.*"/>
          <parameter key="follow_link_with_matching_url" value=".*chef_2014.*"/>
        <parameter key="output_dir" value="C:\rottentomatoes reviews &amp; Clustering\Rapidminer Output"/>
        <parameter key="max_pages" value="10"/>
        <parameter key="max_depth" value="4"/>
        <parameter key="max_page_size" value="1000"/>
        <parameter key="user_agent" value="test"/>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>


Best Answer


  • pix123pix123 Member Posts: 27 Contributor I
    @Telcontar120 Thank you for this
  • pix123pix123 Member Posts: 27 Contributor I
    @Telcontar120 Is there a way to export the ISOOObject Collection files to CSV?
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor Posts: 2,048  Community Manager
    @Telcontar120 - AFAIK there are no plans to update the web mining extension in the near-to-medium future BUT there is a newly-certified RapidMiner Expert I know who lives in a gloriously beautiful country called Chile who may be coerced into porting Selenium into a RapidMiner extension.. :wink: @rfuentealba


  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 979   Unicorn
    I'd loooove a Selenium extension for RapidMiner!  That would be epic @rfuentealba !
    FYI, @sgenzer I did confirm with Helge that this is a bug with the Crawl Web operator.  It looks like it is related to https pages (which is a shame since that is like 90% of the web these days).

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member Posts: 252   Unicorn
    I have good news: the Selenium extension for RapidMiner is in pre-alpha stage, we at Pegasus are investing a lot of time in building something that can, for now, duplicate the behavior of what already exists on RapidMiner. As we spoke with @sgenzer at RapidMiner Wisdom is that the extension will have blocks that perform certain actions such as "clicking on certain button", "retrieving the content from certain element, class or id", "store the content into a document", "waiting for the site to be ready before doing such things", "going back to the previous page". So you can build your navigation flow.

    It is taking me ages because of my current travel plans (guess what? more delays!), but I have plans to release it at some point in January.
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor Posts: 2,048  Community Manager
    :smiley: :smiley: :smiley: :smiley:
  • KPLKPL RapidMiner Certified Analyst, Member Posts: 7 Contributor II
    edited December 2018
    @Telcontar120 - tried out your workaround above with latest RM Studio 9.0.003 Large Ed. Process failed with:
    • Exception: java.lang.NoClassDefFoundError
    Please ignore - solved the problem, unrelated to new RM version
Sign In or Register to comment.