'Crawl Web' not following certain links

kludikovsky · August 2016

I am new to RM an trying to explore capabilities.

When I try to specify the links to follow, the required links where not followed as expected.

I have finally removed all 'imitations' and the links are still not followed.

One conclusion was that relative URL's are not handled properly. But this proved wrong with a test on the Site http://www.formel1.de/rennergebnisse/2016/grosser-preis-von-deutschland/rennen with 'rennergebnisse

I use this simple process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
 <context>
 <input/>
 <output/>
 <macros/>
 </context>
 <operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
 <parameter key="logverbosity" value="init"/>
 <parameter key="random_seed" value="2001"/>
 <parameter key="send_mail" value="never"/>
 <parameter key="notification_email" value=""/>
 <parameter key="process_duration_for_mail" value="30"/>
 <parameter key="encoding" value="SYSTEM"/>
 <process expanded="true">
 <operator activated="true" class="web:crawl_web" compatibility="7.2.000" expanded="true" height="68" name="Crawl Web (2)" width="90" x="45" y="34">
 <parameter key="url" value="https://firmen.wko.at/Web/Ergebnis.aspx?StandortID=123&amp;StandortName=Innsbruck+Land&amp;Branche=3852&amp;BranchenName=Industrie&amp;CategoryID=0"/>
 <list key="crawling_rules">
 <parameter key="follow_link_with_matching_url" value=".*"/>
 </list>
 <parameter key="write_pages_into_files" value="false"/>
 <parameter key="add_pages_as_attribute" value="true"/>
 <parameter key="output_dir" value="C:\Users\Administrator\Documents\traRM"/>
 <parameter key="extension" value="txt"/>
 <parameter key="max_pages" value="5"/>
 <parameter key="max_depth" value="999"/>
 <parameter key="domain" value="server"/>
 <parameter key="delay" value="1000"/>
 <parameter key="max_threads" value="1"/>
 <parameter key="max_page_size" value="99000"/>
 <parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36 OPR/38.0.2220.41"/>
 <parameter key="obey_robot_exclusion" value="false"/>
 <parameter key="really_ignore_exclusion" value="true"/>
 </operator>
 <operator activated="false" class="store" compatibility="7.2.000" expanded="true" height="68" name="Store" width="90" x="514" y="34">
 <parameter key="repository_entry" value="../data/WKO_Test"/>
 </operator>
 <connect from_op="Crawl Web (2)" from_port="Example Set" to_port="result 1"/>
 <portSpacing port="source_input 1" spacing="0"/>
 <portSpacing port="sink_result 1" spacing="0"/>
 <portSpacing port="sink_result 2" spacing="0"/>
 </process>
 </operator>
</process>

The result is just one page and the log shows:

Aug 4, 2016 10:51:32 AM INFO: Process //Local Repository/processes/WKO_Retrieve_Only starts
Aug 4, 2016 10:51:34 AM INFO: Storing page https://firmen.wko.at/Web/Ergebnis.aspx?StandortID=123&StandortName=Innsbruck+Land&Branche=3852&BranchenName=Industrie&CategoryID=0
Aug 4, 2016 10:51:34 AM INFO: Saving results.
Aug 4, 2016 10:51:34 AM INFO: Process //Local Repository/processes/WKO_Retrieve_Only finished successfully after 2 s

If I set the domain param from 'server' to 'web', other site-links are followed but still not those from within this site.

What I am doing wrong ?

Thomas_Ott · August 2016

Hi kludikovsky,

Are you trying to crawl all the links on the site?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

'Crawl Web' not following certain links

Answers