Options

"Text Mining-Crawler problem"

sijusonysijusony Member Posts: 5 Contributor II
edited May 2019 in Help
hi every one,
                    I am facing a problem while using crowler......i tried the following code.

<operator name="Root" class="Process" expanded="yes">
    <parameter key="logfile" value="C:\Documents and Settings\284561\Desktop\rapid\logfile.log"/>
    <parameter key="resultfile" value="C:\Documents and Settings\284561\Desktop\rapid\result.res"/>
    <operator name="Crawler" class="Crawler">
        <list key="crawling_rules">
          <parameter key="follow_url" value="spreeblick"/>
          <parameter key="visit_content" value="google"/>
        </list>
        <parameter key="output_dir" value="C:\Documents and Settings\284561\Desktop\rapid"/>
        <parameter key="url" value="http://www.spreeblick.com/"/>
    </operator>
</operator>


if i run this i am geting a message as process successful.But i am not able to see the HTML pages in the specified output directory.

Can any one teel me wat the problem is .I am also attaching my logfiles also


P Dec 15, 2008 2:01:44 PM: Logging: log file is 'logfile.log'...
P Dec 15, 2008 2:01:44 PM: Initialising process setup
P Dec 15, 2008 2:01:44 PM: Checking properties...
P Dec 15, 2008 2:01:44 PM: Properties are ok.
P Dec 15, 2008 2:01:44 PM: Checking process setup...
P Dec 15, 2008 2:01:44 PM: Inner operators are ok.
P Dec 15, 2008 2:01:44 PM: Checking i/o classes...
P Dec 15, 2008 2:01:44 PM: i/o classes are ok. Process output: ExampleSet.
P Dec 15, 2008 2:01:44 PM: Process ok.
P Dec 15, 2008 2:01:44 PM: Process initialised
P Dec 15, 2008 2:01:44 PM: [NOTE] Process starts
P Dec 15, 2008 2:01:44 PM: Process:
  Root[1] (Process)
  +- Crawler[1] (Crawler)
Last message repeated 1 times.
P Dec 15, 2008 2:02:05 PM: Produced output:
IOContainer (2 objects):
SimpleExampleSet:
0 examples,
2 regular attributes,
no special attributes

(created by Crawler)
com.rapidminer.operator.crawler.LinkMatrix@13ddd13
(created by Crawler)

P Dec 15, 2008 2:02:05 PM: [NOTE] Process finished successfully after 21 seconds

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    probably your crawling rules forbid the storing any page found. The parameter have the following meaning:
    The following condition types are supported to specify which links to follow:
    follow_url A link is only followed, if the target URL contains all terms stated in this parameter.
    link_text A link is only followed, if the link text contains all terms stated in this parameter.

    The conditions that state whether to store a page or not allow for the following expressions:
    visit_url A page is only stored if its URL contains all terms stated in this parameter.
    visit_content A page is only stored if its content contains all terms stated in this parameter.
    For more information see http://nemoz.org/joomla/content/view/64/53/lang,de/

    Greetings,
      Sebastian
  • Options
    sijusonysijusony Member Posts: 5 Contributor II
    hi Sebastian,

                   I tried with crawler for an intranet site, it is working fine.But when i am trying to crawl ,internet sites  its giving me problem.
                                  The user agent i am using is rapid-miner-crawler .For accessing intranet sites, do i hav to use any other useragents.
                         

                        thank you for your quick replay.
    greetings ,
    Siju Sony Mathew
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    perhabs they forbid this type of user agent for their site, or even excluded crawlers in the robots.txt.

    Greetings,
      Sebastian
  • Options
    sijusonysijusony Member Posts: 5 Contributor II
    hi,

            Is there any other user agent by which the crawler can access the Webpages.


    greetings,
    Siju
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    the parameter user_agent in the crawler speciefies the string used to authenticate the client to the http server. You might put in arbitrary values, for example the values for internet explorer, firefox or something else. If its your own webpage you could even turn of "obey_robot_exclusion", causing the crawler to igonore bans within the robots.txt. But do this only if its your own page!

    Greetings,
      Sebastian
Sign In or Register to comment.