"Text Mining-Crawler problem"

sijusony · December 2008

hi every one,
I am facing a problem while using crowler......i tried the following code.

<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Documents and Settings\284561\Desktop\rapid\logfile.log"/>
<parameter key="resultfile" value="C:\Documents and Settings\284561\Desktop\rapid\result.res"/>
<operator name="Crawler" class="Crawler">
<list key="crawling_rules">
<parameter key="follow_url" value="spreeblick"/>
<parameter key="visit_content" value="google"/>
</list>
<parameter key="output_dir" value="C:\Documents and Settings\284561\Desktop\rapid"/>
<parameter key="url" value="http://www.spreeblick.com/"/>
</operator>
</operator>

if i run this i am geting a message as process successful.But i am not able to see the HTML pages in the specified output directory.

Can any one teel me wat the problem is .I am also attaching my logfiles also

P Dec 15, 2008 2:01:44 PM: Logging: log file is 'logfile.log'...
P Dec 15, 2008 2:01:44 PM: Initialising process setup
P Dec 15, 2008 2:01:44 PM: Checking properties...
P Dec 15, 2008 2:01:44 PM: Properties are ok.
P Dec 15, 2008 2:01:44 PM: Checking process setup...
P Dec 15, 2008 2:01:44 PM: Inner operators are ok.
P Dec 15, 2008 2:01:44 PM: Checking i/o classes...
P Dec 15, 2008 2:01:44 PM: i/o classes are ok. Process output: ExampleSet.
P Dec 15, 2008 2:01:44 PM: Process ok.
P Dec 15, 2008 2:01:44 PM: Process initialised
P Dec 15, 2008 2:01:44 PM: [NOTE] Process starts
P Dec 15, 2008 2:01:44 PM: Process:
Root[1] (Process)
+- Crawler[1] (Crawler)
Last message repeated 1 times.
P Dec 15, 2008 2:02:05 PM: Produced output:
IOContainer (2 objects):
SimpleExampleSet:
0 examples,
2 regular attributes,
no special attributes

(created by Crawler)
com.rapidminer.operator.crawler.LinkMatrix@13ddd13
(created by Crawler)

P Dec 15, 2008 2:02:05 PM: [NOTE] Process finished successfully after 21 seconds

land · December 2008

Hi,
probably your crawling rules forbid the storing any page found. The parameter have the following meaning:

The following condition types are supported to specify which links to follow:
follow_url A link is only followed, if the target URL contains all terms stated in this parameter.
link_text A link is only followed, if the link text contains all terms stated in this parameter.

The conditions that state whether to store a page or not allow for the following expressions:
visit_url A page is only stored if its URL contains all terms stated in this parameter.
visit_content A page is only stored if its content contains all terms stated in this parameter.

For more information see http://nemoz.org/joomla/content/view/64/53/lang,de/

Greetings,
Sebastian

sijusony · December 2008

hi Sebastian,

I tried with crawler for an intranet site, it is working fine.But when i am trying to crawl ,internet sites its giving me problem.
The user agent i am using is rapid-miner-crawler .For accessing intranet sites, do i hav to use any other useragents.

thank you for your quick replay.
greetings ,
Siju Sony Mathew

land · December 2008

Hi,
perhabs they forbid this type of user agent for their site, or even excluded crawlers in the robots.txt.

Greetings,
Sebastian

sijusony · December 2008

hi,

Is there any other user agent by which the crawler can access the Webpages.

greetings,
Siju

land · December 2008

Hi,
the parameter user_agent in the crawler speciefies the string used to authenticate the client to the http server. You might put in arbitrary values, for example the values for internet explorer, firefox or something else. If its your own webpage you could even turn of "obey_robot_exclusion", causing the crawler to igonore bans within the robots.txt. But do this only if its your own page!

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Text Mining-Crawler problem"

Answers