Load intensive processes and operators - RM Server autoscaling testing

NikouyNikouy Member Posts: 22 Contributor II
edited March 2020 in Help

I have set up a Kubernetes cluster for RM-Server using EKS and I need to run a series of tests for Horizontal, Vertical and Cluster scaling. I need to generate a lot of load, and I would like to use some real world processes to generate load.

- What kind of processes/operators would exhaust the memory?
- What kind of processes/operators are heavier on the CPU?
- Is there any process publicly available that I can use, either for prediction, classification or something else?

I do not really care about what I am processing, as long as I can exhaust memory and/or CPU while using a real data set.




  • Options
    hbajpaihbajpai Member Posts: 102 Unicorn
    Hey @Nikouy ,

    I feel loops are one of the easiest way to check out the exhaustion of the memory in RM. Especially, if we deactivate the parallel execution.

    Try the below process. Also, please share the results, I am interested in understanding the auto scaling aspect too.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
      <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="9.6.000" expanded="true" height="68" name="Generate Data" width="90" x="112" y="34">
            <parameter key="target_function" value="random"/>
            <parameter key="number_examples" value="1000000"/>
            <parameter key="number_of_attributes" value="50"/>
            <parameter key="attributes_lower_bound" value="-10.0"/>
            <parameter key="attributes_upper_bound" value="10.0"/>
            <parameter key="gaussian_standard_deviation" value="10.0"/>
            <parameter key="largest_radius" value="10.0"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          <operator activated="true" class="extract_macro" compatibility="9.6.000" expanded="true" height="68" name="Extract Macro" width="90" x="380" y="34">
            <parameter key="macro" value="total_i"/>
            <parameter key="macro_type" value="number_of_examples"/>
            <parameter key="statistics" value="average"/>
            <parameter key="attribute_name" value=""/>
            <list key="additional_macros"/>
          <operator activated="true" class="concurrency:loop" compatibility="9.6.000" expanded="true" height="82" name="Loop" width="90" x="581" y="34">
            <parameter key="number_of_iterations" value="%{total_i}"/>
            <parameter key="iteration_macro" value="i"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="false"/>
            <process expanded="true">
              <operator activated="true" class="filter_example_range" compatibility="9.6.000" expanded="true" height="82" name="Filter Example Range" width="90" x="380" y="34">
                <parameter key="first_example" value="%{i}"/>
                <parameter key="last_example" value="%{i}"/>
                <parameter key="invert_filter" value="false"/>
              <operator activated="true" class="generate_attributes" compatibility="9.6.000" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34">
                <list key="function_descriptions">
                  <parameter key="junk" value="att1+att10+att11"/>
                <parameter key="keep_all" value="true"/>
              <connect from_port="input 1" to_op="Filter Example Range" to_port="example set input"/>
              <connect from_op="Filter Example Range" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
          <connect from_op="Generate Data" from_port="output" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_op="Loop" to_port="input 1"/>
          <connect from_op="Loop" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="147"/>

  • Options
    NikouyNikouy Member Posts: 22 Contributor II
    Hey @hbajpai.

    Thanks for your input. I tried this process in my laptop and almost fries it! I'll be giving it a try in my Cluster and share my findings here! Still trying to figure out how to fix the Kubernetes DNS, so the loadbalancer redirects the requests to multiple server (for high availability).

    Community, is there any way I can do something similar using superviserd algorithms? Thinking of using some large data set from UCI.


  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
    Hi @Nikouy,
    can you maybe explain why you are doing this? We are running some tests like this internally of course. But what do you try to get out of it?

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    NikouyNikouy Member Posts: 22 Contributor II
    edited April 2020

    I am currently undertaking a research project as part of my MSc dissertation. Rapidminer is the focus of my project, which answers to a call from the scientific and big data community to “develop scalable higher-level models” (Elshawi et al., 2018) and thus help those with needs to automate the flexible scaling of infrastructure (Zhao et al., 2015).
    Therefore, I am exploring how to deploy an auto-scalable Rapidminer fleet in the cloud, using Kubernetes and provide a reference architecture.
    Obviously, I will need to test the system after its implementation to demonstrate high-availability and scalability, and I would like to do so using real data sets and various algorithms in order to understand how it behaves under different circumstances or test cases.


  • Options
    NikouyNikouy Member Posts: 22 Contributor II

    Thank you for taking the time to write such a detailed reply and highliting the differences between HPC and Blackboard Systems. Using parallel processing is something I consider key, therefore the reason why I was asking which algorithms (either supervided or non supervised) would make good use of paralell processing so I could simply focus in one or two processes at max.

    I didn't quite get the point number 3 you made, so I'd appreciate if you could expand on this. What would I be ahieving with this?

    Thanks again,
  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn


    I understand that you are launching more agents with Kubernetes on demand depending on the process, am I right?

    When you use a local process that requires parallel work, RapidMiner launches these parallel processes in the same machine. What processes can do that?

    ·         Looping with “use parallel execution”.

    ·         Cross validation.

    ·         Feature selection.

    When you do such a thing on RapidMiner Server, it does the same (parallel processes in the same machine), the same processes are applied.

    But if you are talking about horizontal scaling (adding more machines), your processes need to be ready to send data to other RapidMiner agents, and that is done by creating a process that can be scheduled through the server. For horizontal scaling, you should invoke “Schedule Process” in a loop, and Cross Validation and Feature Selection can no longer be parallelized on many servers.

    Basically that’s the reason on why (my humble opinion) I think you might want to focus on scoring with a previously trained model: it will be easier for you to research on horizontal and vertical scaling. If you want to discuss this in private, drop me a line.

    All the best,

  • Options
    NikouyNikouy Member Posts: 22 Contributor II
    Thanks Rodrigo, totally makes sense :). I'll probably be reaching out.

    @hbajpai, I tried executing in the server the process that you suggested but it looks (for some reason) that Studio ends picking it up? Please see the screenshot below from my laptop. I did not see any load increase at all in the server.


Sign In or Register to comment.