"Using Regex in the web crawler"

guitarslinger · April 2010

Hi there,

I am struggling with the setup of the crawlers in the web mining extension:

I can't figure out how to set the crawling rules so that the crawler produces any results.
Leaving the rules empty does not work either.

Can I find an example for crawling rules somewhere?

Thx in advance

GS

B_Miner · April 2010

Post what you are trying to do (XML) and description. Maybe someone can help. I used it successfully, but again are not sure your aim

guitarslinger · April 2010

Hi B_Miner, good point:

Here ist the XML, just having the crawler connected to the main process and having two rules:
1. follow every link ".*"
2. store every page ".*"

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="673" width="1094">
      <operator activated="true" class="web:crawl_web" expanded="true" height="60" name="Crawl Web" width="90" x="109" y="144">
        <parameter key="url" value="http://www.aol.com"/>
        <list key="crawling_rules">
          <parameter key="3" value=".*"/>
          <parameter key="1" value=".*"/>
        </list>
        <parameter key="write_pages_into_files" value="false"/>
        <parameter key="output_dir" value="C:\Users\Martin\Desktop\crawltest"/>
        <parameter key="max_depth" value="10"/>
        <parameter key="max_page_size" value="1000"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

guitarslinger · April 2010

Problem solved: I had no value in parameter "max. pages".

I thought this parameter is optional, leaving it blank will just not limit the number of pages, but actually without any value it does not crawl at all.

Works now, I am happy!

Regards GS
;D

land · April 2010

Well,
it should be optional. ****. I will make sure, it's optional in future

Good thing you got it to work, though.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Using Regex in the web crawler"

Answers