Options

which operator allows me to use all four logical processors for parallel computing?

jsramirezgojsramirezgo Member Posts: 6 Learner I
I Got a Rapid Miner Enterprise Medium License (I am able to use four logical processors), however, I dont know how to use it. I mean, I dont know which operator in Rapid Miner allows me to deploy parallel computing by using more than one logical processor.

I really appreciate the help because this is part of an academical research and I need urgent help in order to continue my experiment.

Thanks!

Answers

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @jsramirezgo - most of the main modeling operators (and many of the ETL ones) are already optimized for parallelization and there is literally nothing you need to do. When you run processes (or Auto Model), you should see your processors all kick in. If you do NOT see them kick in, please let us know.

    Scott

  • Options
    jsramirezgojsramirezgo Member Posts: 6 Learner I
    Hi  sgenzer, thanks a lot for your reply.

    I got two questions regarding your reply:

    1. how can I see whether the processors all kick in?

    2. the operator "execute process" that runs multiple process can be named as a parallel computing operator?

    Thanks!
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @jsramirezgo -

    1. to see your CPU usage really depends on your computer. On my Mac I look at Activity Monitor. I think there are more techie ways to do this but I will let others chime in who know more. Maybe @Telcontar120? @rfuentealba? @lionelderkrikor?

    2. The "Execute Process" operator will simply execute a process somewhere else in RapidMiner. If that process is optimized for parallelization, it will run parallelized.

    Hope that helps?

    Scott



  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @jsramirezgo,

    To answer to Scott's question, I'm using Windows as OS, but 
    I'm using a Rainmeter skin which is displayed on the desktop of my computer and which displays the "CPU usage" (with % Core 1 , % Core 2, % Core 3 etc.) and the "RAM usage" in %.

    Regards,

    Lionel
  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    You can also use the Log operator to capture CPU execution time and memory utilization.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    edited April 2019

    🤦‍♂️ #occamsrazor thx @Telcontar120 @lionelderkrikor

  • Options
    jsramirezgojsramirezgo Member Posts: 6 Learner I
    Hello guys.

    thanks a lot for your replies. About checking the processors I realised how to do it, thank you!

    regarding scott’s answer, I think I have a new question that would help me to resolve this finally:

    How can I know If a process is optimized for parallelization?

    thanks!
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    oh that's easier - just change this parameter (in preferences) and look at the execution time :smile:


  • Options
    jsramirezgojsramirezgo Member Posts: 6 Learner I
    Excellent information Scott. I finally solved my questions.

    Thank you very much! =)
  • Options
    jsramirezgojsramirezgo Member Posts: 6 Learner I
    Sorry Scott, I realised it didn't work for me. In theory I understood what you said and I set up the parallel execution in preferences. Also, my hardware has 8 logical processors, however, when I did my test with the parameter "worker threads for active process=0" I got a certatin execution time, but, when I tested with the parameter "worker threads for active process=4" I got the same execution time.

    I think the idea of parallel computing is to reduce time, but both escenarios had the same execution time. what was wrong? Do I need an specific operator?

    I really appreciate your help. Sorry for bein persistence. Just want to clarify my doubts with RM.
  • Options
    jczogallajczogalla Employee, Member Posts: 144 RM Engineering
    Setting the preference to 0 means that all available/allwoed processors should be used. If you want to test multiple cores versus one core, set the preference to 0 (or 4) and 1 respectively.
    Also, for the loop operators that are parallelized, there is an option "enable parallel execution" which lets you decide if you want to execute the loop iterations in parallel or not.

    Hope this helps!
    Jan
  • Options
    jsramirezgojsramirezgo Member Posts: 6 Learner I
    Hey Jan!

    thanks a lot! Quite clear. Questions solved.

    regards.
  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hello,

    I am late to this reply, hence I'll add a few more things, not closely related with parallel execution of operators but it does with processes.

    I have a MacBook Pro for RapidMiner Studio, and I really don't care about parallel execution on it. However, for my world domination projects, I use 4 MacBook Pro's with 12-core i9-9900, each one with an agent configured to run up to 11 parallel tasks. If you have such a setup, use Nagios. It uses the SNMP protocol to monitor the status of the machines, and due to the nature of that protocol, it doesn't affect much of the network throughput.

    Now be sure that I'm not running one task in parallel in all these computers (is it feasible? I need it badly) but many different enqueued processes. More often than not when I need this kind of power, I divide my processes and use the Schedule Process operator to cascade, or an API with data through RapidMiner Server.

    A real case for this: let's say I have 32000 pages from a website that you need to apply NLP. I do convert these to examples and perform a Loop Examples, pass the entire data on a POST to the API and finish the process. This creates 32000 requests to the RapidMiner Server, and the results are solved with 44 processes. In my last development project, 42 tasks served by all 4 computers could solve nearly 1200 pages per minute, taking only 30 minutes. I did that with my old good MacBook Air and it took 7 hours to complete the same task.

    If anyone has a better suggestion for me, I'm all ears.

    Too bad the MacBooks aren't mine :(

    Just my two cents.

    All the best,

    Rodrigo.
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    holy cow @rfuentealba looks like you have quite the rig!
  • Options
    SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Without knowing much about the use case, I would try to reduce the number of calls to the server. Each call generates a tremendous overhead (if the number of calls remains large, consider using the scoring agent).

    One option is to do the web crawling on a separate process (possibly with an external tool), save the pages to a file or in the repository and then have RM Server process the files/dataset on one or more scheduled processes.

    Let me know if this helps, if you tell us more maybe we come up with more ideas

    Regards,
    Sebastian

  • Options
    SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn


    IMHO the parallalism inside a given process is handled quite well by RapidMiner, you don't need to do anything (that is a great advantage compared to doing data science in a programming language). The kind of parallelism that would be most useful to you is running processes at the same time.

    Do you know that you can run processes in the background?



    That way you can keep working while your experiments run, pretty neat!
Sign In or Register to comment.