Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Operator 'Get Pages' not running on AI Hub

methusimethusi Member Posts: 5 Learner I
edited December 2021 in Help
Hi

I have a process running on an AI Hub where I have the operator 'Get Pages' (ext. Web Mining) embedded.
When I run the process in RM Studio everything is fine.
When I run the process on AI Hub but started it from the RM Studio ('Run Process on AI Hub'), everything is fine.

But when I kick off the web-service I created, the operator 'Get-Pages' seems to make trouble. Other web-services are running. And when I disable 'Get Pages' the web-service is running as well. So I strongly believe it has something to do with how the process runs on AI Hub.

This is the error message which I get on running the web-service:
de.rapidanalytics.ejb.service.ServiceDataSourceException 
Error executing process /home/bot/test_pages for service test_pages: 
com.rapidminer.operator.web.io.MultiThreadedCookieManager cannot be cast to 
com.rapidminer.operator.web.io.MultiThreadedCookieManager<br>

The funny thing is that I found out is that if I run the process out of the repository on AI Hub, it runs successfully. But if I test the web-service, it does not work.

This is the process I used for testing. When I disable the operator 'Get Pages' everything works fine.
<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.10.001" expanded="true" height="68" name="Retrieve step_3_urls_after_python_short" width="90" x="112" y="136">
        <parameter key="repository_entry" value="/home/user/some_table_with_urls"/>
      </operator>
      <operator activated="true" class="web:retrieve_webpages" compatibility="9.7.000" expanded="true" height="68" name="Get Pages" width="90" x="447" y="136">
        <parameter key="link_attribute" value="links"/>
        <parameter key="random_user_agent" value="true"/>
        <parameter key="connection_timeout" value="10000"/>
        <parameter key="read_timeout" value="10000"/>
        <parameter key="follow_redirects" value="true"/>
        <parameter key="accept_cookies" value="original server"/>
        <parameter key="cookie_scope" value="global"/>
        <parameter key="request_method" value="GET"/>
        <parameter key="delay" value="none"/>
        <parameter key="delay_amount" value="1000"/>
        <parameter key="min_delay_amount" value="0"/>
        <parameter key="max_delay_amount" value="500"/>
      </operator>
      <connect from_op="Retrieve step_3_urls_after_python_short" from_port="output" to_op="Get Pages" to_port="Example Set"/>
      <connect from_op="Get Pages" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>



I don't know how to proceed.

Thanks for all the help!

Best
Mathis

Best Answer

  • methusimethusi Member Posts: 5 Learner I
    Solution Accepted
    For the ones wondering - I could fix my problem by taking another route. Instead of calling a web service I schedule the process with the schedule API:
    POST to server/executions/schedule with the corresponding headers and body

    In the body, I do not set an execution time and force=true - this immediately starts the execution.

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    I suspect the issue might be that you have the extension that contains "Get Pages" installed on the AI-Hub JobAgent, but not on the Server itself. 

    If I recall the architecture diagram correctly, when you schedule a job or run it on the Server from Studio then it will execute on a JobAgent. 
    However, if it is run as a webservice then it doesn't run on a JobAgent, but on the Server itself. 

    Check

    [docker volumes path]/prod_rm-server-home-vol/_data/resources/extensions and see if you can spot it in there.  You can compare it against

    [docker volumes path]/prod_rm-server-ja-extensions and see if they match. 


Sign In or Register to comment.