"Problem with RapidMiner Crawler"

lexusboy · June 2009

Hello,

I started using RapidMiner recently for crawling web sites. However, I have been facing some problems with some web sites when I use RM, I really like RapidMiner's performance including the ease with which you can configure to suit your needs, and I want to stick with it, so any help would be appreciated.

Here is a snapshot from my log file

P May 20, 2009 3:29:59 PM: Initialising process setup
P May 20, 2009 3:29:59 PM: [NOTE] No filename given for result file, using stdout for logging results!
P May 20, 2009 3:29:59 PM: Checking properties...
P May 20, 2009 3:29:59 PM: Properties are ok.
P May 20, 2009 3:29:59 PM: Checking process setup...
P May 20, 2009 3:29:59 PM: Inner operators are ok.
P May 20, 2009 3:29:59 PM: Checking i/o classes...
P May 20, 2009 3:29:59 PM: i/o classes are ok. Process output: ExampleSet, NumericalMatrix.
P May 20, 2009 3:29:59 PM: Process ok.
P May 20, 2009 3:29:59 PM: Process initialised
P May 20, 2009 3:29:59 PM: [NOTE] Process starts
P May 20, 2009 3:29:59 PM: Process:
Root[0] (Process)
+- Crawler[0] (Crawler)
G May 20, 2009 3:29:59 PM: [Fatal] ArrayIndexOutOfBoundsException occured in 1st application of Crawler (Crawler)
G May 20, 2009 3:29:59 PM: [Fatal] Process failed: operator cannot be executed (0). Check the log messages...
Root[1] (Process)
here ==> +- Crawler[1] (Crawler)

land · June 2009

Hi,
unfortunately I cannot see anything from this log, beside that there is an error

If you could post your process containing the crawler here, I would be able to reproduce the error and hence could try to resolve it.

Greetings,
Sebastian

lexusboy · June 2009

Hello Sebastian,

Here is my process in XML structure, hope this is what you meant

<operator name="Root" class="Process" expanded="yes">
<operator name="Crawler" class="Crawler">
<parameter key="url" value="http://www.triathlon-szene.de/forum/"/>
<list key="crawling_rules">
<parameter key="visit_content" value=""new balance""/>
<parameter key="follow_url" value="laufforum.de"/>
</list>
<parameter key="delay" value="0"/>
<parameter key="max_threads" value="3"/>
<parameter key="output_dir" value="C:\Documents and Settings\Bhavya\My Documents\RapidMiner\laufforum"/>
</operator>
</operator>

Thanks in advance!

Regards,
Bhavya

land · June 2009

Hi,
thank you. I will take a look at it, as soon as possible. But the error seems to be in the wvtool, which makes debugging a lot more complex

Greetings,
Sebastian

lexusboy · June 2009

Hello,

Thank you Sebastian, I hope you can find a solution for this problem

Best Regards,
Bhavya

miwahattori · October 2009

Hello all,

I'm wondering if there was any resolution to this problem. I'm using v.4.6 of the Plug-in and getting the same exact error. My process is essentially the same as Bhavya's on this thread, so rather than start a new post I'm looking for any follow up on this. The error occurs only with some starting URLs.

Any guidance will be appreciated!
Miwa

land · October 2009

Hi,
sorry for the late answer, but I guess that's not a bug in the rapid miner crawler, but instead the forums simply forbade robots crawling their forum in their robots.txt. The crawler obeys this rule as long as obey_robot_exclusion is checked. This setting should not be changed, as long as the website owner does not allow you to scroll it's page.
Another possibility is, that the forum only allows user agents, which are identified as browser.

Greetings,
Sebastian

miwahattori · October 2009

Sebastian,

Thanks for your response. I had the obey_robot_exclusion rule unchecked as I was running a test retrieval on our own organization's homepage, and I was still getting the error. However, there was a workaround-- our intuition is that the homepage I was trying to crawl had too many links thus causing an array index overflow. Our homepage housed a list of 1000+ links. After splitting the list into smaller partitions by creating multiple html files to use as starting URLs, each containing about 200 links, the crawler ran without an error on each. If you could confirm whether the number of links being too numerous is known to cause an issue on the crawler, I would very much appreciate that, but in the mean time we are able to continue on with this workaround.

Best regards,
Miwa

land · October 2009

Hi,
I have never heard of such problems. Shortly we will revise the crawler, I will then take this hint into account.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Problem with RapidMiner Crawler"

Answers