"First Steps in Webmining"

JibJabJabJibJabJab Member Posts: 3 Contributor I
edited May 23 in Help
So I decided to get a bit deeper into rapidminer and defined my first challenge.
I want the crawler to get every posting of a blog which mentions a certain word:

First if I start with the wizard but it seems to expect having already an existing database/file to work with.
So what I did is taking the "naked" Root Progress, adding the crawler operator
and configuration the rules to something like that:

<operator name="Root" class="Process" expanded="yes">
    <parameter key="logfile" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\randomlogfile.log"/>
    <parameter key="resultfile" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\randomresultfile.res"/>
    <operator name="Crawler" class="Crawler">
        <list key="crawling_rules">
          <parameter key="follow_url" value="spreeblick"/>
          <parameter key="visit_content" value="google"/>
        </list>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="output_dir" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\nsv"/>
        <parameter key="url" value="http://www.spreeblick.com/"/>
    </operator>
</operator>

So the crawler should go to spreeblick.com, follow only urls which include the letters "spreeblick"
and only save those having somewhere the letters "google" inside the page.
Now, the funny thing is, it even starts crawling, but ONLY if the "obey_robot_exclusion" is
active. If I deactivate it, I get an "Process failed, RuntimeException caught: JOption Pane: parentComponent does
not have a valid parent." error.
Just to make sure so far... what am I doing wrong to get this strange robot exclusion error?
Tagged:

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,702  RM Founder
    Hi,

    If I deactivate it, I get an "Process failed, RuntimeException caught: JOption Pane: parentComponent does not have a valid parent." error.
    There was a bug in the crawling operator which we have just fixed. Usually, there should be a dialog asking the user if crawling without obeying the "robots.txt" should really be performed since this might not legal / appropriate in all cases.

    You can get the fixed version via CVS (the bug was in the text plugin, formerly known as "wvtool", hence the module name) and the bugfix will of course also be part of the next release.

    Cheers,
    Ingo
    RapidMiner Wisdom 2020
    February 11th and 12th 2020 in Boston, MA, USA

  • 296M296M Member Posts: 3 Contributor I
    compared to rapidminer, httrack is much more powerful and faster as a crawler.

    that's why even textinput also provides tutorial for using httrack.  ;D
  • Legacy UserLegacy User Member Posts: 0 Newbie
    Hi 296M, Hi All,

    Did you know "webharvest" ( http://web-harvest.sourceforge.net/ , I do not remember if I have already talked of that )? It is a kind of high level scripting language that looks like XML, and aimed at specifying which type of harvesting task you want to perform. Assuming that you could call a WebHarvest script from RapidMiner, you could do exactly what you want...

    @Ingo & Steffen :
    May a "scripting box" for Webharvest be an interesting feature request ?

    Cheers,
      Jean-Charles.
  • rajbanokhanrajbanokhan Member Posts: 28  Maven

    hi

    i have a question how to get data from web sites behind the hyperlinks. on web page how we get data from hyperlinks which are used on every web page.



    from  raj


  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761   Unicorn
Sign In or Register to comment.