Problems with webcrawl - SOLVED

oju987oju987 Member Posts: 5 Contributor II
edited March 23 in Help
I used rapidminer a few months back and I am starting using it again with the a version.  I used to work with the webcrawl operator and even though, there were some limitations, it used to work most of the time.

Right now I am working on a project to extract data from a real state site but I really frustrated as I can't make it to work.  I have tried different sites but I never get one page out of any of the sites that I have tried.

Am I doing something really bad?   Has something changed from the previous version that I am not considering.

I tried this:

<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="web:crawl_web_modern" compatibility="9.3.001" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="34">
        <list key="crawling_rules"/>
        <parameter key="max_crawl_depth" value="5"/>
        <parameter key="retrieve_as_html" value="true"/>
        <parameter key="enable_basic_auth" value="false"/>
        <parameter key="add_content_as_attribute" value="false"/>
        <parameter key="write_pages_to_disk" value="false"/>
        <parameter key="include_binary_content" value="false"/>
        <parameter key="output_file_extension" value="txt"/>
        <parameter key="max_pages" value="1000"/>
        <parameter key="max_page_size" value="30000"/>
        <parameter key="delay" value="200"/>
        <parameter key="max_concurrent_connections" value="100"/>
        <parameter key="max_connections_per_host" value="50"/>
        <parameter key="user_agent" value="rapidminer-web-mining-extension-crawler"/>
        <parameter key="ignore_robot_exclusion" value="true"/>
      </operator>
      <connect from_op="Crawl Web" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>


I also tried the basic url: https://mapainmueble.com/

 But same result, no pages extracted. Thanks for any help!

Best Answer

Answers

  • oju987oju987 Member Posts: 5 Contributor II
    Thanks for you answer.  You just confirmed my worst fears.    It is over 2,000 links, so I used Octoparse which is pretty simple to use and obtain the list of urls and then I will still use rapidminer to extract and cleanse the data.

Sign In or Register to comment.