Options

V6.3 parallelze problems

corkiecorkie Member Posts: 10 Contributor II
edited November 2018 in Help
I'm running into execution problems with the process below; without parallel enabled on the process it takes 1min 27sec to complete.
the process has 7.73Million examples and 26 attributes.

If I enable parallel on the process the application crashes frequently.
I thought that by enabling the parallel option on the process and subprocesses that this would help processing in cases like this?

Are there recommendations for the configuration of number of thread parameters?
preferences - miscellaneous Max no of threads = 0?
preferences - paralel - Number of threads = 4?
these values are both defaulted.

are there guidelines on the usage of these parameters? / values that should not be set?
is it possible to increase these values to get better process execution times?

I'd appreciate any feedback / other user experience

Kind regards,
Daryl.

system:
Macbook pro
dual core i7 3Ghz processor
8GB 1600Mhz DDR3 memory
SSD
Rapidminer studio pro 6.3 license (8GB ram).
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.3.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="6.3.000" expanded="true" height="60" name="Get AWOR Dataset" width="90" x="45" y="30">
        <description>get the Adverse weather outage dataset</description>
        <parameter key="repository_entry" value="../../Data_V4/AWOR/AWOR_Cooked-Dataset-V4"/>
      </operator>
      <operator activated="true" class="replace" compatibility="6.3.000" expanded="true" height="76" name="Replace" width="90" x="179" y="30">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="OutageCountyid"/>
        <parameter key="replace_what" value="NL"/>
        <parameter key="replace_by" value="DL"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="6.3.000" expanded="true" height="60" name="Get hly2075 dataset" width="90" x="45" y="165">
        <description>get the donegal hourly dataset</description>
        <parameter key="repository_entry" value="../../Data_V4/HourlyA/hly2075"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="6.3.000" expanded="true" height="60" name="ReGet SynopticStationsByCounty-V2" width="90" x="45" y="255">
        <description>get the synopitc station dataset</description>
        <parameter key="repository_entry" value="../../Data_V4/CountyLocations/SynopticStationsByCounty-V4"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="6.3.000" expanded="true" height="112" name="Merge Datasets" width="90" x="246" y="165">
        <process expanded="true">
          <operator activated="true" class="join" compatibility="6.3.000" expanded="true" height="76" name="Hly + county" width="90" x="45" y="120">
            <description>Join the hourly donegal meterlogolocal information with the synoptic station information necessary to make the forecast
</description>
            <parameter key="join_type" value="left"/>
            <parameter key="use_id_attribute_as_key" value="false"/>
            <list key="key_attributes">
              <parameter key="StationName" value="SStationName"/>
            </list>
          </operator>
          <operator activated="true" class="join" compatibility="6.3.000" expanded="true" height="76" name="AWOR + hly'" width="90" x="179" y="30">
            <description>Join the AWOR dataset with the result of the data merger of the donegal hourly with the synoptic station data</description>
            <parameter key="join_type" value="left"/>
            <parameter key="use_id_attribute_as_key" value="false"/>
            <list key="key_attributes">
              <parameter key="OutageCountyid" value="SSCountyid"/>
            </list>
          </operator>
          <connect from_port="in 1" to_op="AWOR + hly'" to_port="left"/>
          <connect from_port="in 2" to_op="Hly + county" to_port="left"/>
          <connect from_port="in 3" to_op="Hly + county" to_port="right"/>
          <connect from_op="Hly + county" from_port="join" to_op="AWOR + hly'" to_port="right"/>
          <connect from_op="AWOR + hly'" from_port="join" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="72"/>
          <portSpacing port="source_in 3" spacing="0"/>
          <portSpacing port="source_in 4" spacing="54"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="subprocess" compatibility="6.3.000" expanded="true" height="76" name="Make Forecast" width="90" x="380" y="120">
        <description>Make a forecast</description>
        <process expanded="true">
          <operator activated="true" class="generate_attributes" compatibility="6.3.000" expanded="true" height="76" name="Create Forecast" width="90" x="45" y="30">
            <description>create the forecast
using the followign regular expression:
"if(OutageHour==DateUTC,"OutageForecasted","NoOutageForecasted")"</description>
            <list key="function_descriptions">
              <parameter key="ForecastedOutageHour" value="if(OutageHour==DateUTC,&quot;OutageForecasted&quot;,&quot;NoOutageForecasted&quot;)"/>
            </list>
          </operator>
          <operator activated="true" class="set_role" compatibility="6.3.000" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
            <description>make the forecast attribute as spectial type 'label'</description>
            <parameter key="attribute_name" value="ForecastedOutageHour"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="6.3.000" expanded="true" height="76" name="Generate ID" width="90" x="45" y="210">
            <description>create an id for the forecast</description>
          </operator>
          <operator activated="true" class="rename" compatibility="6.3.000" expanded="true" height="76" name="Rename id" width="90" x="179" y="210">
            <parameter key="old_name" value="id"/>
            <parameter key="new_name" value="ForecastID"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <connect from_port="in 1" to_op="Create Forecast" to_port="example set input"/>
          <connect from_op="Create Forecast" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Rename id" to_port="example set input"/>
          <connect from_op="Rename id" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="180"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="store" compatibility="6.3.000" expanded="true" height="60" name="st forecast" width="90" x="514" y="120">
        <description>store the donegal forecast</description>
        <parameter key="repository_entry" value="../../Data_V4/Forecast/CookedHlyForecast_DS-stn2075-V4a"/>
      </operator>
      <connect from_op="Get AWOR Dataset" from_port="output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_op="Merge Datasets" to_port="in 1"/>
      <connect from_op="Get hly2075 dataset" from_port="output" to_op="Merge Datasets" to_port="in 2"/>
      <connect from_op="ReGet SynopticStationsByCounty-V2" from_port="output" to_op="Merge Datasets" to_port="in 3"/>
      <connect from_op="Merge Datasets" from_port="out 1" to_op="Make Forecast" to_port="in 1"/>
      <connect from_op="Make Forecast" from_port="out 1" to_op="st forecast" to_port="input"/>
      <connect from_op="st forecast" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="90"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
    Hi,

    as far as i know, the parallel extension is not activly supported by Rapidminer anymore. I do not know if it is working correctly with all operators.

    Personally i just use it to parallize a X-Val. That is somehow "safe".

    Cheers,

    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    corkiecorkie Member Posts: 10 Contributor II
    martin, cheers for the feedback;
    - would you notice any performance boost from applying the parallel to the X val operator?
    rgds,
    daryl.
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Hi Daryl,

    I'm on a similar system to you (Mac Pro, 3.5GHz Intel E5 6-core, 16GB RAM but only have 4GB RM license, SSD drive, etc...) and I too notice erratic performance when benchmarking both locally and using RM Cloud when pushing CPU-intensive processes.  I also had no idea that the parallelization extension is no longer supported.  Hmm.

    One thing I do is watch the CPU usage in Activity Monitor very closely while processes run.  Watch it both with parallelization on and off (especially during X-validation) and see if you can see an increase in CPU load.  I was on a MacBook Pro before and it just could not handle it.  Hence the upgrade to Mac Pro - enormous difference in performance.

    Also, I don't know if you pull in data sets along the way but it is MUCH faster to "remember" and "recall" than "store" and "read" (and that is also much faster than "read CSV" and "write CSV" for obvious reasons).  I don't know why but it's true from my experience.  Don't forget to turn off the "remove from store" checkbox if you use that over and over again.

    Scott
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist
    oh, i did not want to sound officially!

    i just don't know, that any of our developers work on the parallel extension.Tthus i called it "unsupported". Maybe i used the wrong word.

    This does not mean that rapidminer does drop this extension.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    corkiecorkie Member Posts: 10 Contributor II
    Scott,
    I had come across the remember & recall, and have started to use them.
    there are all of tips and tricks to learn within rapidminer.
    I had thought the parallel option might have given me more benefit.
    in reality I need some decent research funding so that I can work completely in the cloud /
    on the server edition.

    rgds,
    daryl.

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Hi Daryl,

    If you're in academia, you can apply for a RM academic license now - MUCH cheaper than retail.  See the web site.

    Scott
Sign In or Register to comment.