Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Load intensive processes and operators - RM Server autoscaling testing
Hello,
I have set up a Kubernetes cluster for RM-Server using EKS and I need to run a series of tests for Horizontal, Vertical and Cluster scaling. I need to generate a lot of load, and I would like to use some real world processes to generate load.
- What kind of processes/operators would exhaust the memory?
- What kind of processes/operators are heavier on the CPU?
- Is there any process publicly available that I can use, either for prediction, classification or something else?
I do not really care about what I am processing, as long as I can exhaust memory and/or CPU while using a real data set.
Thanks,
Nicolas
Tagged:
2
Answers
I feel loops are one of the easiest way to check out the exhaustion of the memory in RM. Especially, if we deactivate the parallel execution.
Try the below process. Also, please share the results, I am interested in understanding the auto scaling aspect too.
Harshit
can you maybe explain why you are doing this? We are running some tests like this internally of course. But what do you try to get out of it?
Best,
Martin
Dortmund, Germany
Here is something to keep in mind for your research:
(Apologies in advance for the length of this)
RapidMiner is not a HPC (High Performance Computing) system but rather a Blackboard System. The difference between the two is fundamental to choose the kind of processes you will need to include. Let’s review the differences between the two:
A HPC system is a distributed system that works at operating system level or root-enabled service level, depending on the implementation. HPC systems assign processor resources and memory upon creation of the service. If you have a multiprocessing-based implementation of a chess game engine, it will distribute processes until the resources are exhausted. In that case, vertical scaling (e.g. adding RAM and processors) is difficult because typically not only the physical installation of hardware needs to be done but also some server reconfiguration (on Linux, this is typically done via modifying the sysctl values), and it would be better to add more nodes (horizontal scaling). You will probably not find data science suites that use this kind of system because most of the code is written directly to make good use of every single processor cycle (because they are needed to calculate stuff as complex as Navier-Stokes equation systems).
A Blackboard system, pattern or architecture works at user level or user-enabled service level. Blackboard systems don’t normally control processor resources or memory but rather have predefined agents (either inside the same software as thread pools or as external resources). Blackboard systems distribute processes through a server that creates a queue on each agent (normally a database that is checked constantly, or a queuing system like Redis). In that case, vertical scaling (e.g. adding RAM and processors) is a matter of changing a few variables in the agent and restart. In the case of RapidMiner, the RapidMiner Server controls the job agents and real time scoring agents through its database. Almost all the data science suites use variants of this architecture (there are plenty) because these are easier to maintain (you don’t require a data scientist who is also a super expert senior black belt ninja sensei in parallel processing, which is a dark black unicorn among the unicorns) and since there is a single non-volatile storage available it is easier to work with large data.
Now… what does this mean for you?
Deactivating parallel processing on a single computer only means that all the processing will be done in a second thread inside RapidMiner Studio (to not make the GUI unresponsive) and since the process is large, it will probably make lots of resource blocking internally, that’s what fries your computer. You should parallelize when you can, for your own sanity. Now, depending on the version of Studio that you have, you should check how many threads can be opened (each thread has its own core from your processor. Therefore in my AMD ThreadRipper with 48 cores I can run 46 calculation threads plus the one for the program and one for the operating system, and in my i9 with 16 cores I can only do 14 calculation threads plus the two aforementioned ones.
Activating parallel processing on RapidMiner Server won’t assign more resources automatically. Instead, you should be seeking for tasks that you can deliver to your servers through an operator that is special for that: the Schedule Process operator does exactly that. Vertically scaling means you can actually launch more job agents on a certain machine and that’s it, or configure the same job agents to have more processors and RAM; horizontally scaling only means you can launch more job agents in different machines.
With that said, I would recommend you to:
1. Take time to train and test a model using RapidMiner Studio. It will be painful if you don’t do it well, but since what you want to test is how to scale things, it wouldn’t be a problem to use… don’t know, a downsampled dataset.
2. Store the model on RapidMiner Server.
3. Create a process that performs a loop over your data and performs one “Schedule Process” operator per record.
4. See how each node is working. Make sure you measure things using SNMP if you can, because that will give you a broader picture on consumption.
I would recommend you this dataset to do so.
https://plg.uwaterloo.ca/~gvcormac/treccorpus07/about.html
That’s it, my two cents.
Hope this helps,
Rod.
Sure!
I understand that you are launching more agents with Kubernetes on demand depending on the process, am I right?
When you use a local process that requires parallel work, RapidMiner launches these parallel processes in the same machine. What processes can do that?
· Looping with “use parallel execution”.
· Cross validation.
· Feature selection.
When you do such a thing on RapidMiner Server, it does the same (parallel processes in the same machine), the same processes are applied.
But if you are talking about horizontal scaling (adding more machines), your processes need to be ready to send data to other RapidMiner agents, and that is done by creating a process that can be scheduled through the server. For horizontal scaling, you should invoke “Schedule Process” in a loop, and Cross Validation and Feature Selection can no longer be parallelized on many servers.
Basically that’s the reason on why (my humble opinion) I think you might want to focus on scoring with a previously trained model: it will be easier for you to research on horizontal and vertical scaling. If you want to discuss this in private, drop me a line.
All the best,Rod