Options

Predictive Analysis of Gym Members

renegadeZHrenegadeZH Member Posts: 3 Contributor I
edited November 2018 in Help

Hi

 

For a Big Data university class we have to do a predictive analysis of a data file. It is about people visiting a gym

and my task is to build a model, which can predict, when the gym is too crowded. 

 

I am in the very beginning of this class, therefore we only work with Nested Holdout Testing, Cross Validation and Random Forrests.

 

We have to answer the following questions:

 

1.) Which usage pattern of the gym can you identify on the basis of a visual analysis of the data set

2.) What is the generalisation performance of your "best" model? Does it tend to strong overfitting?

3.) Which differences can you observe between a decision tree and a random forest?

4.) When would you suggest to someone to go to the gym?

 

Description of the data:

• Number of people
• timestamp (number of seconds since beginning of day)
• day_of_week (0 - 6)
• is_weekend (0 or 1)
• is_holiday (0 or 1)
• apparent_temperature (degrees fahrenheit)
• temperature (degrees fahrenheit)
• is_start_of_semester (0 or 1)

 

I know how to build the models (more or less) , but I'm having a hard time reading the essential informations out of it. 

 

Any help is kindly appreciated.

 

Attached you'll find the csv file

 

 

Best Answer

  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    Solution Accepted

    If overfitting is your worry (which is the usual worry), then toggle on the "Leave One Out" parameter on the Cross Validation. 

     

    Something like this example

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Iris" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Samples/data/Iris"/>
    </operator>
    <operator activated="true" class="split_data" compatibility="7.4.000" expanded="true" height="103" name="Split Data" width="90" x="179" y="238">
    <enumeration key="partitions">
    <parameter key="ratio" value="0.5"/>
    <parameter key="ratio" value="0.5"/>
    </enumeration>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.4.000" expanded="true" height="145" name="Validation" width="90" x="380" y="34">
    <parameter key="leave_one_out" value="true"/>
    <parameter key="sampling_type" value="stratified sampling"/>
    <process expanded="true">
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.4.000" expanded="true" height="82" name="Decision Tree" width="90" x="45" y="34"/>
    <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Decision Tree" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    <description align="left" color="green" colored="true" height="80" resized="true" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.4.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance" compatibility="7.4.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <connect from_op="Performance" from_port="example set" to_port="test set results"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    <description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).&lt;br/&gt;The performance is evaluated and sent to the operator results.</description>
    </process>
    <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.4.000" expanded="true" height="82" name="Apply Model (2)" width="90" x="581" y="238">
    <list key="application_parameters"/>
    </operator>
    <connect from_op="Retrieve Iris" from_port="output" to_op="Split Data" to_port="example set"/>
    <connect from_op="Split Data" from_port="partition 1" to_op="Validation" to_port="example set"/>
    <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

Answers

  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Do you have any screenshoots to share or a process that you've built?  Are you getting hung up on the performance results?

  • Options
    renegadeZHrenegadeZH Member Posts: 3 Contributor I

    Hi T-Bone

     

    Attached you'll find the operators which I have used to build my models. (It's only for the Cross Validation, I only change one operator to do another approach)

     

    In general I started with generating an attribute to declare, whether a gym is crowded or not (crowded if(number_people >= 30, 1, 0).

     

    Yes, I'm getting hung up with results. The performance indicators tells me if my model prediction is good enough. (I double check with the stored data afterwards --> Out of sample accuracy).

    But for example I can't tell, when someone should go to the gym, there are so many influences and the decision tree is huge.

     

    What would be your approach?

     

    Regards

     

    renegadeZH

     

    PS: I also have problems with the operator "Optimize Parameters". My PC gets frozen all the time.

     

  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Ok a few things I have questions about right off the bat. Why are you using a Split operator right before the Cross Validation (CV) operator?  The CV automatically handles the split into Training and Testing data, so not sure what that's doing there?

     

    Also, why are you saving the model in the Training side?  CV means you split up the data into random k-folds, so each model you save will be over written k times. If you want to save the overall trained model, it's best to do it on the MOD port of the CV operator on the outside. 

     

    Why is the Performance operator giving you a warning? Check that message to make sure everything is alright. 

     

    Optimize Parameters is a heavy memory operator, if will freeze if you don't have enough memory to power through all the combinations. An alternative is to use the Evolutionary Optimize Parameters. 

  • Options
    renegadeZHrenegadeZH Member Posts: 3 Contributor I

    Because after getting the results/running the process, we want to compare how accurate our prediction model is.

    We use/retrieve the data from the store operator after the split operator + the data from store operator in the training side. Then we add the "Apply Model" Operator + Performance Operator to check whether there is 'overfitting'. (See attached file)

    The warning is because the performance operator cannot read/accept polynominal attributes, but it shouldn't affect my data..

     

    regards

Sign In or Register to comment.