Options

"Using Regex in the web crawler"

guitarslingerguitarslinger Member Posts: 12 Contributor II
edited June 2019 in Help
Hi there,

I am struggling with the setup of the crawlers in the web mining extension:

I can't figure out how to set the crawling rules so that the crawler produces any results.
Leaving the rules empty does not work either.

Can I find an example for crawling rules somewhere?

Thx in advance

GS
Tagged:

Answers

  • Options
    B_MinerB_Miner Member Posts: 72 Contributor II
    Post what you are trying to do (XML) and description. Maybe someone can help. I used it successfully, but again are not sure your aim
  • Options
    guitarslingerguitarslinger Member Posts: 12 Contributor II
    Hi B_Miner, good point:

    Here ist the XML, just having the crawler connected to the main process and having two rules:
    1. follow every link ".*"
    2. store every page ".*"
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="673" width="1094">
          <operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="109" y="144">
            <parameter key="url" value="http://www.aol.com"/>
            <list key="crawling_rules">
              <parameter key="3" value=".*"/>
              <parameter key="1" value=".*"/>
            </list>
            <parameter key="write_pages_into_files" value="false"/>
            <parameter key="output_dir" value="C:\Users\Martin\Desktop\crawltest"/>
            <parameter key="max_depth" value="10"/>
            <parameter key="max_page_size" value="1000"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

  • Options
    guitarslingerguitarslinger Member Posts: 12 Contributor II
    Problem solved: I had no value in parameter "max. pages".

    I thought this parameter is optional, leaving it blank will just not limit the number of pages, but actually without any value it does not crawl at all.

    Works now, I am happy!

    Regards GS
    ;D
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Well,
    it should be optional. ****. I will make sure, it's optional in future :)
    Good thing you got it to work, though.

    Greetings,
      Sebastian
Sign In or Register to comment.