Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
'Crawl Web' not following certain links
kludikovsky
Member Posts: 30 Maven
I am new to RM an trying to explore capabilities.
When I try to specify the links to follow, the required links where not followed as expected.
I have finally removed all 'imitations' and the links are still not followed.
One conclusion was that relative URL's are not handled properly. But this proved wrong with a test on the Site http://www.formel1.de/rennergebnisse/2016/grosser-preis-von-deutschland/rennen with 'rennergebnisse
I use this simple process:
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="7.2.000" expanded="true" height="68" name="Crawl Web (2)" width="90" x="45" y="34">
<parameter key="url" value="https://firmen.wko.at/Web/Ergebnis.aspx?StandortID=123&StandortName=Innsbruck+Land&Branche=3852&BranchenName=Industrie&CategoryID=0"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value=".*"/>
</list>
<parameter key="write_pages_into_files" value="false"/>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="output_dir" value="C:\Users\Administrator\Documents\traRM"/>
<parameter key="extension" value="txt"/>
<parameter key="max_pages" value="5"/>
<parameter key="max_depth" value="999"/>
<parameter key="domain" value="server"/>
<parameter key="delay" value="1000"/>
<parameter key="max_threads" value="1"/>
<parameter key="max_page_size" value="99000"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36 OPR/38.0.2220.41"/>
<parameter key="obey_robot_exclusion" value="false"/>
<parameter key="really_ignore_exclusion" value="true"/>
</operator>
<operator activated="false" class="store" compatibility="7.2.000" expanded="true" height="68" name="Store" width="90" x="514" y="34">
<parameter key="repository_entry" value="../data/WKO_Test"/>
</operator>
<connect from_op="Crawl Web (2)" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
The result is just one page and the log shows:
Aug 4, 2016 10:51:32 AM INFO: Process //Local Repository/processes/WKO_Retrieve_Only starts
Aug 4, 2016 10:51:34 AM INFO: Storing page https://firmen.wko.at/Web/Ergebnis.aspx?StandortID=123&StandortName=Innsbruck+Land&Branche=3852&BranchenName=Industrie&CategoryID=0
Aug 4, 2016 10:51:34 AM INFO: Saving results.
Aug 4, 2016 10:51:34 AM INFO: Process //Local Repository/processes/WKO_Retrieve_Only finished successfully after 2 s
If I set the domain param from 'server' to 'web', other site-links are followed but still not those from within this site.
What I am doing wrong ?
0
Answers
Hi kludikovsky,
Are you trying to crawl all the links on the site?