Operator 'Get Pages' not running on AI Hub

methusimethusi Member Posts: 5 Learner I
edited December 2021 in Help
Hi

I have a process running on an AI Hub where I have the operator 'Get Pages' (ext. Web Mining) embedded.
When I run the process in RM Studio everything is fine.
When I run the process on AI Hub but started it from the RM Studio ('Run Process on AI Hub'), everything is fine.

But when I kick off the web-service I created, the operator 'Get-Pages' seems to make trouble. Other web-services are running. And when I disable 'Get Pages' the web-service is running as well. So I strongly believe it has something to do with how the process runs on AI Hub.

This is the error message which I get on running the web-service:
de.rapidanalytics.ejb.service.ServiceDataSourceException 
Error executing process /home/bot/test_pages for service test_pages: 
com.rapidminer.operator.web.io.MultiThreadedCookieManager cannot be cast to 
com.rapidminer.operator.web.io.MultiThreadedCookieManager<br>

The funny thing is that I found out is that if I run the process out of the repository on AI Hub, it runs successfully. But if I test the web-service, it does not work.

This is the process I used for testing. When I disable the operator 'Get Pages' everything works fine.
<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.10.001" expanded="true" height="68" name="Retrieve step_3_urls_after_python_short" width="90" x="112" y="136">
        <parameter key="repository_entry" value="/home/user/some_table_with_urls"/>
      </operator>
      <operator activated="true" class="web:retrieve_webpages" compatibility="9.7.000" expanded="true" height="68" name="Get Pages" width="90" x="447" y="136">
        <parameter key="link_attribute" value="links"/>
        <parameter key="random_user_agent" value="true"/>
        <parameter key="connection_timeout" value="10000"/>
        <parameter key="read_timeout" value="10000"/>
        <parameter key="follow_redirects" value="true"/>
        <parameter key="accept_cookies" value="original server"/>
        <parameter key="cookie_scope" value="global"/>
        <parameter key="request_method" value="GET"/>
        <parameter key="delay" value="none"/>
        <parameter key="delay_amount" value="1000"/>
        <parameter key="min_delay_amount" value="0"/>
        <parameter key="max_delay_amount" value="500"/>
      </operator>
      <connect from_op="Retrieve step_3_urls_after_python_short" from_port="output" to_op="Get Pages" to_port="Example Set"/>
      <connect from_op="Get Pages" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>



I don't know how to proceed.

Thanks for all the help!

Best
Mathis

Best Answer

  • methusimethusi Member Posts: 5 Learner I
    Solution Accepted
    For the ones wondering - I could fix my problem by taking another route. Instead of calling a web service I schedule the process with the schedule API:
    POST to server/executions/schedule with the corresponding headers and body

    In the body, I do not set an execution time and force=true - this immediately starts the execution.

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    I suspect the issue might be that you have the extension that contains "Get Pages" installed on the AI-Hub JobAgent, but not on the Server itself. 

    If I recall the architecture diagram correctly, when you schedule a job or run it on the Server from Studio then it will execute on a JobAgent. 
    However, if it is run as a webservice then it doesn't run on a JobAgent, but on the Server itself. 

    Check

    [docker volumes path]/prod_rm-server-home-vol/_data/resources/extensions and see if you can spot it in there.  You can compare it against

    [docker volumes path]/prod_rm-server-ja-extensions and see if they match. 


Sign In or Register to comment.