"issues with web crawling [UPDATED]"

kayman · April 2015

UPDATE :

It seems that below scenario is common for any website that uses query parameters in the url.

In other words, any URL that looks like 'http://domain.com/some_blabla?param1=something&param2=somethingelse' is not crawled.

Some examples :

https://www.worten.pt/inicio/imagem-e-som/tv.html -> no problem
https://www.worten.pt/inicio/imagem-e-som/tv.html?p=3 -> not crawled when using above page as starting point, works fine if entered directly

these are my crawling rules, pretty basic :

            
<parameter key="url" value="https://www.worten.pt/inicio/imagem-e-som/tv.html"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".*imagem-e-som/tv.html.*"/>
<parameter key="follow_link_with_matching_url" value=".*imagem-e-som/tv.html.*"/>
</list>

same problem with this site :

http://www.fnac.com/Tous-les-televiseurs/Televiseur/nsh75822/w-4 -> no problem
http://www.fnac.com/Tous-les-televiseurs/Televiseur/nsh75822/w-4?PageIndex=3#3 -> won't get crawled from above page, ok if entered directly

            
<parameter key="url" value="http://www.fnac.com/Tous-les-televiseurs/Televiseur/nsh75822/w-4"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value="http://www.fnac.com/Tous-les-televiseurs/Televiseur/.*"/>
<parameter key="follow_link_with_matching_url" value="http://www.fnac.com/Tous-les-televiseurs/Televiseur/.*"/>
</list>

Same goes for below full example :

https://www.otto.de/multimedia/fernseher/led-fernseher/ -> no problem
https://www.otto.de/multimedia/fernseher/led-fernseher/?p=2&;ps=30 -> not being crawled from above link, ok if entered directly

I'm using similar logic on different sites taht all work fine, but as soon as a question mark appears in the url the logic is broken
Is this a bug, or am I overlooking something ? Is the same issue seen with version 6 ?

[ORIGINAL QUESTION]

I'm creating a process to compare prices from different retailer, for most of these it works fine but some are really driving me nuts when it comes to following links.

This is an example :

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="5.3.015" expanded="true" height="60" name="crawl_and_store" width="90" x="45" y="30">
        <parameter key="parallelize_nested_chain" value="true"/>
        <process expanded="true">
          <operator activated="true" class="web:crawl_web" compatibility="5.3.002" expanded="true" height="60" name="Crawl_tv" width="90" x="45" y="30">
            <parameter key="url" value="https://www.otto.de/multimedia/fernseher/led-fernseher/"/>
            <list key="crawling_rules">
              <parameter key="store_with_matching_url" value=".*/led-fernseher/.*"/>
              <parameter key="follow_link_with_matching_url" value=".*/led-fernseher/.*"/>
            </list>
            <parameter key="add_pages_as_attribute" value="true"/>
            <parameter key="output_dir" value="D:\mining\test"/>
            <parameter key="extension" value="html"/>
            <parameter key="max_pages" value="2000"/>
            <parameter key="max_depth" value="3"/>
            <parameter key="max_threads" value="4"/>
            <parameter key="max_page_size" value="2000"/>
            <parameter key="obey_robot_exclusion" value="false"/>
            <parameter key="really_ignore_exclusion" value="true"/>
          </operator>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
        </process>
      </operator>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

I can only get one page, being the base url https://www.otto.de/multimedia/fernseher/led-fernseher/

On the page though there are links using following standard : https://www.otto.de/multimedia/fernseher/led-fernseher/?p=2&;ps=30 but I can't get them crawled. I've tried several regex patterns, all of them are matched when testing but the page will not get crawled. Any idea what the problem could be ?

Thanks in advance !

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"issues with web crawling [UPDATED]"