WebCrawl with Login

Poman · September 2009

Hi

Before I do any datamining, I need to crawl and grab some documents from the web. I am using the TextPlugin for crawling. But the site that I am trying to get all the pages from needs a login. Is there a way through RapidMiner to login then crawl the pages. I am currently getting

[Fatal] ArrayIndexOutOfBoundsException occured in 1st application of Crawler (Crawler)
[Fatal] Process failed: operator cannot be executed (0). Check the log messages...

I am assuming this error is occuring because I am not logged in. I am able to crawl anyother public page.

Is there a way to configure the crawler to login?

Thanks!

land · September 2009

Hi,
you might try to code the login data into the URL. This might work if a standard form is used for transmitting the information and should work, if a standard HTTP authentication is needed. A google search should help you finding the correct format.

Greetings,
Sebastian

Poman · September 2009

Thanks Sebastian!

Unfortunately, the login to the webpages that I am attempting are more complex so it will not work. The login info in the webpages are sent through the HTTP Post Method instead of the HTTP Get method, hence ... nothing is put into the URL. But there an opensource java library called HTTPClient that can help with this task. It will basically help to login to the website and hold a constant session with it. After this is created ... I can pass control to the crawler to do its stuff.

But to do this, I will need to modify the crawler functions and recompile it so that the HTTP commands the crawler sends will be sent through my functions.

I have currently developed the crawler (without the login functionality) through the given RapidMiner API using the "edu.udo.cs.wvtool.crawler.WVToolCrawler" which comes as part of the rapidminer-text-4.5.jar. I chose to use this class to write the webcrawler because it is used as an example in "The Word Vector Tool and RapidMiner Text Plugin: User Guide, Operator Reference, and Developer Tutorial" (the guide created on July 19th, 2009). The webcrawler code example is given in Chapter 4 of "Advanced Topics".

Since I need to modify this functionality so that it allows login functionality, I downloaded the source for the rapidminer-text-4.5.jar. Unfortunately, it does not seem to have the source for "edu.udo.cs.wvtool.crawler.WVToolCrawler" with it?

So two questions:
1) Is there another place to retrieve this source for "edu.udo.cs.wvtool.crawler.WVToolCrawler"?
2) Is there another way to write and use the crawler functionality through other functions in the API for which a source exists?

Thanks,
--Pritesh

land · September 2009

Hi Pritesh,
the TextPlugin is based upon the word vector tool which is available at source-forge. You should take a look there and check out the sources.
There are no other api functions, but you could use the crawler api itself.

Anyway I found your approach quite interesting. If you like, you could send me some details about your solution, we might include this in the next version of the web crawler, which is currently in planning status.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

WebCrawl with Login

Answers