RapidMiner

RapidMiner

WEB crawler rules

Contributor II

WEB crawler rules

Hi!

I'm new to RapidMiner and I must say I like it. I have in-depth knowledge in MS SQL but I'm completely fresh in RapidMiner.
So I've started to use Web Crawler Processor.

I use the following query to process Slovenian real estate webpage and I have troubles setting Web crawler rules.

I know that there are 2 rules important: what to follow and what to store.

I would like to store "http://www.realestate-slovenia.info/nepremicnine.html"+id=something
for example this is the URL i want to store http://www.realestate-slovenia.info/nepremicnine.html?id=5725280

What about URL rule to follow? It doesn't seem to work. I tried something like that: .+pg.+|.+id.+

Any help would be apreciated!

U.
4 REPLIES
Super Contributor

Re: WEB crawler rules

Hey U,

on a quick check I got some pages with the following settings:
url: http://www.realestate-slovenia.info/
both rules: .+id.+

And I also increased the max page size to 10000.

As always I have to ask this: did you check that the site policy/copyright note allows you to machine-crawl that page?

Best regards,
Marius
Contributor II

Re: WEB crawler rules

Marius,

the web page allows robots.

Your example stores only realestate ads on first page. Web crawler doesn't go to the second, third,.....page.

Tnx for helping.
Super Contributor

Re: WEB crawler rules

Then you probably have to increase the max_depth and adapt your rules. Please note that you should not add more than one follow rule, but instead add all expressions to one single rule, separated by a vertical bar as you have done in your first post.

Best regards,
Marius
Contributor II

Re: WEB crawler rules

Marius,

I put a problem with Web crawler aside for a while. Today I started to deal with it again. I still have a problem with crawling rules. All other web crawler atributes are clear.

This is my Web crawler process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="120">
        <parameter key="url" value="http://www.realestate-slovenia.info/nepremicnine.html?q=sale"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value="http://www.realestate-slovenia.info/nepremicnine.html?(q=sale| q=sale[&amp;]pg=.+ | id=.+)"/>
          <parameter key="store_with_matching_url" value="http://www.realestate-slovenia.info/nepremicnine.html?id=.+"/>
        </list>
        <parameter key="output_dir" value="C:\RapidMiner\RealEstate"/>
        <parameter key="extension" value="html"/>
        <parameter key="max_depth" value="4"/>
        <parameter key="domain" value="server"/>
        <parameter key="max_page_size" value="10000"/>
        <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

As you can see I try to follow 3 types of URL, for example

http://www.realestate-slovenia.info/nepremicnine.html?q=sale
http://www.realestate-slovenia.info/nepremicnine.html?q=sale&;pg=6
http://www.realestate-slovenia.info/nepremicnine.html?id=5744923

And I want to store only one type of URL

http://www.realestate-slovenia.info/nepremicnine.html?id=5469846

So for the first task my rule is

http://www.realestate-slovenia.info/nepremicnine.html?(q=sale | q=sale&pg=.+ | id=.+)

Fpr the second task rule is:
http://www.nepremicnine.net/nepremicnine.html?id=.+

Rules seems to be valid, but no output documents are returned. I've tried many different combination, for example
.+pg.+ | .+id.+ for the first task and .+id.+ for the second task, but the later returns so many pages that are not my focus.

I would really like this process to work cause gathered data are the basis for my article.

Tnx.