"First Steps in Webmining"

JibJabJabJibJabJab Member Posts: 3 Contributor I
edited May 2019 in Help
So I decided to get a bit deeper into rapidminer and defined my first challenge.
I want the crawler to get every posting of a blog which mentions a certain word:

First if I start with the wizard but it seems to expect having already an existing database/file to work with.
So what I did is taking the "naked" Root Progress, adding the crawler operator
and configuration the rules to something like that:

<operator name="Root" class="Process" expanded="yes">
    <parameter key="logfile" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\randomlogfile.log"/>
    <parameter key="resultfile" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\randomresultfile.res"/>
    <operator name="Crawler" class="Crawler">
        <list key="crawling_rules">
          <parameter key="follow_url" value="spreeblick"/>
          <parameter key="visit_content" value="google"/>
        <parameter key="obey_robot_exclusion" value="false"/>
        <parameter key="output_dir" value="C:\Dokumente und Einstellungen\pjh\Eigene Dateien\rm_workspace\nsv"/>
        <parameter key="url" value="http://www.spreeblick.com/"/>

So the crawler should go to spreeblick.com, follow only urls which include the letters "spreeblick"
and only save those having somewhere the letters "google" inside the page.
Now, the funny thing is, it even starts crawling, but ONLY if the "obey_robot_exclusion" is
active. If I deactivate it, I get an "Process failed, RuntimeException caught: JOption Pane: parentComponent does
not have a valid parent." error.
Just to make sure so far... what am I doing wrong to get this strange robot exclusion error?


  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    If I deactivate it, I get an "Process failed, RuntimeException caught: JOption Pane: parentComponent does not have a valid parent." error.
    There was a bug in the crawling operator which we have just fixed. Usually, there should be a dialog asking the user if crawling without obeying the "robots.txt" should really be performed since this might not legal / appropriate in all cases.

    You can get the fixed version via CVS (the bug was in the text plugin, formerly known as "wvtool", hence the module name) and the bugfix will of course also be part of the next release.

  • Options
    296M296M Member Posts: 3 Contributor I
    compared to rapidminer, httrack is much more powerful and faster as a crawler.

    that's why even textinput also provides tutorial for using httrack.  ;D
  • Options
    Legacy UserLegacy User Member Posts: 0 Newbie
    Hi 296M, Hi All,

    Did you know "webharvest" ( http://web-harvest.sourceforge.net/ , I do not remember if I have already talked of that )? It is a kind of high level scripting language that looks like XML, and aimed at specifying which type of harvesting task you want to perform. Assuming that you could call a WebHarvest script from RapidMiner, you could do exactly what you want...

    @Ingo & Steffen :
    May a "scripting box" for Webharvest be an interesting feature request ?

  • Options
    rajbanokhanrajbanokhan Member Posts: 29 Maven


    i have a question how to get data from web sites behind the hyperlinks. on web page how we get data from hyperlinks which are used on every web page.

    from  raj

  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
Sign In or Register to comment.