"Process Documents from Web"

BuncBunc Member Posts: 3 Contributor I
edited May 2019 in Help
I am a new user so please excuse if this is a simple problem!
I am having a problem with "process documents from web" operator.

As I understand it this operator is basically the same as Crawl Web operator but allows for an inner process to process crawled content before storing it.

I have the following very simple start for my process
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
   <process expanded="true" height="673" width="1094">
     <operator activated="true" class="web:process_web" compatibility="5.0.3" expanded="true" height="60" name="Process Documents from Web" width="90" x="179" y="120">
       <parameter key="url" value="http://www.aol.com"/>
       <list key="crawling_rules">
         <parameter key="follow_link_with_matching_text" value=".*"/>
         <parameter key="store_with_matching_content" value=".*"/>
       </list>
       <parameter key="add_pages_as_attribute" value="true"/>
       <parameter key="max_pages" value="10"/>
       <process expanded="true" height="385" width="432">
         <connect from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <connect from_op="Process Documents from Web" from_port="example set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>
When I run this I get an error "Process failed for input string follow link with matching text." As a result I cant get past this step to build a fuller process.

I cant see what I have specified wrong - im sure its something simple. I dont have any inner operators specified tathe moment - I tried it with inner operators but that made no difference and the problem is clearly at the start of the process with the crawl and store rules

Answers

  • BuncBunc Member Posts: 3 Contributor I
    Update:
    This may be related to my problems with the above possibly.

    I was using the Crawl web operator and encountered the same problem after editing the crawl rules. The problem remained when I re-edited and inserted the previously working rule.
    When I examined the XML i noted that in the version that reported an error the parameters are coded as follows;

    <list key="crawling_rules">
             <parameter key="follow_link_with_matching_url" value=".*(blogspot.com\/2010)"/>
             <parameter key="1" value=".+"/>
           </list>
    In the working version however the parematers were coded as follows;

    <list key="crawling_rules">
             <parameter key="2" value=".*(blogspot.com\/2010)"/>
             <parameter key="1" value=".+"/>
           </list>
    Notice the parameter key in non working version is given as "Follow_link_to_matching_url"
    Whereas in the second working version the parameter key is "2".

    I did not change these myself and the change could only have been introduced I assume as part of the handling of the edit of the parameter key value. Is this a bug in the program?

    I notice that the parameter keys in my first post (about the process documents from web operator) are also given as text strings rather than numbers. I now wonder if manually changing these to numerical key values would sort the problem - must try this.
  • BuncBunc Member Posts: 3 Contributor I
    Further Update:
    The same behaviour is apparent on using the prcess documents from web operator.
    For information > this seems to be triggered when changing the selected "rule application selection" in the crawling rules pop up window.
    Chanoing the crawling rule inserts text as the key for the paramter - this then throws an error. If the text is replaced by a number as per my previous post then the operator works correctly.
    I will report this as a bug.
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    thank you for this bug report. We fixed another bug, causing this missbehavior, because some programmer adapted to the previously wrong behavior instead of fixing it...

    This will be solved with the next version of Web Mining Extension. Until then you could downgrade your RapidMiner to .008 or lower where the problem should not occur.

    Greetings,
      Sebastian
Sign In or Register to comment.