🎉 🎉. RAPIDMINER 9.8 IS OUT!!! 🎉 🎉

RapidMiner 9.8 continues to innovate in data science collaboration, connectivity and governance

CLICK HERE TO DOWNLOAD

How long is too long for a process to run? (details in post)

petter0619petter0619 Member Posts: 4 Contributor I
edited November 2018 in Help
Hey,

I'm running the "23_Transactional2Basket" process with the following changes:
  • "Replace Operator" was switched to "Import Excel"
  • "Breakpoint After" was unchecked on the "Example2AttributePivoting" operator
The Excel sheet that I am importing contains ca 143,000 rows of data in 2 columns
(Transaction ID and Items). Right now the "Numerical2Polynomial" operator has been
running for over 20 hours. I'm wondering if that is a normal amount of time for a dataset
of this size or if something is wrong?

If something is wrong, is there some change I should make so that the entire set is run or
should I simply decrease the size of the Excel sheet?

Note that I am new to Rapid Miner and I've only ever used it to run the "23_Transactional2Basket"
process.

Thanks in advance for any answers.
//Petter

Answers

  • Marco_BoeckMarco_Boeck Team Lead Software Engineering Administrator, Moderator, Employee, Member, University Professor Posts: 1,954   RM Engineering
    Hi,

    no that is not normal. The "Numerical to Polynomial" should finish a dataset of this size in less than a second on a modern machine. I just ceated a dummy dataset with 145.000 entries and ran it on the sample process, it took just over 4 minutes with the "Pivot" operator taking most of the time.

    What might happen is that you do not have enough memory available for Studio, in which case it will be really busy with trying to clean up after itself to free some memory. In extreme cases, almost 99% of computation power can be used to clean up memory in which case obviously everything else grinds to a halt. Can you open the "System Monitor" by clicking "View" -> "Show View" -> "System Monitor" in the top menu bar and check what happens when you execute the process?
    You can also import the Excel data before and store it in the repository (click "File" -> "Import Data" -> "Import Excel Sheet" in the top menu bar and follow the instructions) and afterwards restart Studio and simply use a "Retrieve" operator to get the data now stored in your repository. That should use quite a bit less memory because the excel import process is quite expensive.

    Regards,
    Marco
  • petter0619petter0619 Member Posts: 4 Contributor I
    First of all, thanks for the reply.

    Imported the Excel sheet and stored it in my Local Repository under Data and then
    used the retrieve Operator in "23_Transactional2Basket" process. The process still
    got stuck on the Numerical2Polynomial operator.

    Ran the earlier version and then the way you suggested, both with the System Monitor
    on and when it came to the Numerical2Polynomial operator on both processes RapidMiner
    seemed to hit a ceiling on the amount of memory available; the System Monitor line/area
    turned red and the text read "7.4 GB used. Will use up to 7.5" (image of this can be seen below).

    I'm assuming this means that my computer does not have enough memory to run RapidMiner Studio?

    The computer I'm running it on has 8GB of memory. Do I need a computer with more memory?

    Again, thanks in advance for your reply.
    //Petter


    image
  • Marco_BoeckMarco_Boeck Team Lead Software Engineering Administrator, Moderator, Employee, Member, University Professor Posts: 1,954   RM Engineering
    Hi,

    The whole process above used about 500MB of memory for me, 7.5GB seems quite excessive..
    And it is probably the reason for the slow process as Java now frantically tries to free memory instead of doing work.

    1) Can you close Studio, reopen it and check what the System Monitor says directly after startup?
    2) What OS are you using?

    Regards,
    Marco
  • petter0619petter0619 Member Posts: 4 Contributor I
    To answer your questions:

    1) I closed & restarted Studio and right on startup the System Monitor says: "128 MB used. Will use up to 7.5". Says the same after process has been created (although not started).

    2) I have a Mac with OSX Mountain Lion v10.8.5.

    By the way, for for simplicity's sake I uploaded a screen capture video (about ca 4min 30 sec) of what I do
    when I run the process. The System Monitor is visible and I also included the app "Memory Clean" in it to show
    it's approximation of how much free memory I have during different parts of the process. Maybe seeing this will
    be of some help to you.

    Link to the screen capture: http://screencast.com/t/59WK4QmUS

    Also, the errors shown in the process are as follows:
    • Operator "Retrieve" - "Parameter 'repository entry' accesses a repository by name (//Local Repository/data/Eleven - MBA2 Excel Sheet). This may not be portable when sharing processes."
    • Operator "AttributeSubsetPreprocessing" - "The attribute called X exists already in original table" (contains ca 100 of these with X being the different product names in the dataset)
    • Operator "FPGrowth" - "Regular attributes must be of type binominal."
    Regards,
    Petter
  • Marco_BoeckMarco_Boeck Team Lead Software Engineering Administrator, Moderator, Employee, Member, University Professor Posts: 1,954   RM Engineering
    Hi,

    thanks for that!
    I have now tested this again and my dummy process takes about 4:30 mins and actually uses about 6GB of memory on my machine. So it might actually very well be true that you simply need a bit more memory for your data. Dunno why my first test showed such skewed results, might be because I ran my first test from within my development environment.
    Note that I did my tests with RapidMiner Studio version 6.0.005 which will be released either today or early next week.

    On a sidenote, FPGrowth can be very slow depending on the data and the settings, just a heads up.

    This was the process I used to test with generated dummy data:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.005">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="6.0.005" expanded="true" name="Root">
       <process expanded="true">
         <operator activated="true" class="generate_data" compatibility="6.0.005" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
           <parameter key="number_examples" value="150000"/>
           <parameter key="number_of_attributes" value="2"/>
           <parameter key="attributes_lower_bound" value="1.0"/>
           <parameter key="attributes_upper_bound" value="100.0"/>
         </operator>
         <operator activated="true" class="select_attributes" compatibility="6.0.005" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="30">
           <parameter key="attribute_filter_type" value="subset"/>
           <parameter key="attributes" value="att1|att2"/>
           <parameter key="include_special_attributes" value="true"/>
         </operator>
         <operator activated="true" class="rename" compatibility="6.0.005" expanded="true" height="76" name="Rename" width="90" x="313" y="30">
           <parameter key="old_name" value="att1"/>
           <parameter key="new_name" value="ITEM"/>
           <list key="rename_additional_attributes">
             <parameter key="att2" value="TID"/>
           </list>
         </operator>
         <operator activated="true" class="generate_id" compatibility="6.0.005" expanded="true" height="76" name="IdTagging" width="90" x="45" y="165"/>
         <operator activated="true" class="set_role" compatibility="6.0.005" expanded="true" height="76" name="IdToRegular" width="90" x="179" y="165">
           <parameter key="attribute_name" value="id"/>
           <list key="set_additional_roles"/>
         </operator>
         <operator activated="true" class="pivot" compatibility="6.0.005" expanded="true" height="76" name="Example2AttributePivoting" width="90" x="313" y="165">
           <parameter key="group_attribute" value="TID"/>
           <parameter key="index_attribute" value="ITEM"/>
         </operator>
         <operator activated="true" class="numerical_to_polynominal" compatibility="6.0.005" expanded="true" height="76" name="Numerical2Polynominal" width="90" x="45" y="300"/>
         <operator activated="true" class="work_on_subset" compatibility="6.0.005" expanded="true" height="76" name="AttributeSubsetPreprocessing" width="90" x="179" y="300">
           <parameter key="attribute_filter_type" value="regular_expression"/>
           <parameter key="regular_expression" value="TID"/>
           <parameter key="invert_selection" value="true"/>
           <parameter key="remove_roles" value="true"/>
           <process expanded="true">
             <operator activated="true" class="map" compatibility="6.0.003" expanded="true" height="76" name="Mapping" width="90" x="45" y="30">
               <parameter key="attribute_filter_type" value="regular_expression"/>
               <parameter key="regular_expression" value=".*"/>
               <list key="value_mappings"/>
               <parameter key="replace_what" value="?"/>
               <parameter key="replace_by" value="false"/>
               <parameter key="add_default_mapping" value="true"/>
               <parameter key="default_value" value="true"/>
             </operator>
             <connect from_port="exampleSet" to_op="Mapping" to_port="example set input"/>
             <connect from_op="Mapping" from_port="example set output" to_port="example set"/>
             <portSpacing port="source_exampleSet" spacing="0"/>
             <portSpacing port="sink_example set" spacing="0"/>
             <portSpacing port="sink_through 1" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="fp_growth" compatibility="6.0.005" expanded="true" height="76" name="FPGrowth" width="90" x="313" y="300">
           <parameter key="positive_value" value="true"/>
         </operator>
         <connect from_op="Generate Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
         <connect from_op="Select Attributes" from_port="example set output" to_op="Rename" to_port="example set input"/>
         <connect from_op="Rename" from_port="example set output" to_op="IdTagging" to_port="example set input"/>
         <connect from_op="IdTagging" from_port="example set output" to_op="IdToRegular" to_port="example set input"/>
         <connect from_op="IdToRegular" from_port="example set output" to_op="Example2AttributePivoting" to_port="example set input"/>
         <connect from_op="Example2AttributePivoting" from_port="example set output" to_op="Numerical2Polynominal" to_port="example set input"/>
         <connect from_op="Numerical2Polynominal" from_port="example set output" to_op="AttributeSubsetPreprocessing" to_port="example set"/>
         <connect from_op="AttributeSubsetPreprocessing" from_port="example set" to_op="FPGrowth" to_port="example set"/>
         <connect from_op="FPGrowth" from_port="example set" to_port="result 1"/>
         <connect from_op="FPGrowth" from_port="frequent sets" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="180"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>
    Regards,
    Marco
  • petter0619petter0619 Member Posts: 4 Contributor I
    Hi,

    Tried your process with the generated dummy data and got the same result as you.
    Ran in about 4-5 mins and the Support Monitor showed that it took ca 150 MB for
    most of it (increased at the end).

    But I will try and run my data on a computer with more memory.

    Thanks for all your help!

    Regards,
    //Petter
Sign In or Register to comment.