How to create confidence intervals for numeric prediction ?

BjörnBjörn Member Posts: 2 Contributor I
edited February 2020 in Help

Hey Community,

I would like to know, how I can generate confidence intervals around the numeric predictions I get from different algorithms. I have around three years of data with values for my label, a shipment amount, that has to be processed. Until now I only used the RMSE, I got through the performance operator, to compare different algorithms or their parameters.

Can you please give me some advice how I could create confidence intervals, for example the 95% confidence around the predicted values, so I can show the users the expected range of shipments.

Thanks in advance.

 

Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Dear Björn,

     

    i think this request is hardly possible to do. At least in a model agnostic fashion. I do not know an algorithm to do this for every model. There are some tricks like using simulations, but they all have strong  assumptions on the underlying distributions.

     

    What model are you using?

     

    cheers,

    martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • BjörnBjörn Member Posts: 2 Contributor I

    Dear Martin,

    thanks for your response. Right now the model I am using is Gradient Boosted Trees.

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    @mschmitz

     

    I had a think about how to do something similar based on the LIME approach.  What about measuring the differences between the prediction & actual value for each record, discretizing that into bands and then building a model to predict how accurately the previous model might predict for a record (given a certain range). 

     

    It's still work in progress and I think it need a bit more thinking about upper & lower bounds rather than just difference between the prediction & reality.  But posting the idea here to get some ideas.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data" compatibility="7.1.001" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
    <parameter key="target_function" value="polynomial"/>
    <parameter key="number_examples" value="10000"/>
    </operator>
    <operator activated="true" class="split_data" compatibility="7.6.001" expanded="true" height="103" name="Split Data into Training and Testing and Scoring" width="90" x="179" y="38">
    <enumeration key="partitions">
    <parameter key="ratio" value="0.7"/>
    <parameter key="ratio" value="0.3"/>
    </enumeration>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Cross Validation" width="90" x="313" y="34">
    <process expanded="true">
    <operator activated="true" class="support_vector_machine" compatibility="7.6.001" expanded="true" height="124" name="SVM" width="90" x="112" y="34"/>
    <connect from_port="training set" to_op="SVM" to_port="training set"/>
    <connect from_op="SVM" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_regression" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34"/>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <connect from_op="Performance" from_port="example set" to_port="test set results"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="447" y="391">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="715" y="289">
    <parameter key="attribute_name" value="label"/>
    <list key="set_additional_roles">
    <parameter key="prediction(label)" value="regular"/>
    </list>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.6.001" expanded="true" height="82" name="Generate Attributes" width="90" x="447" y="136">
    <list key="function_descriptions"/>
    </operator>
    <operator activated="true" class="generate_aggregation" compatibility="7.6.001" expanded="true" height="82" name="Generate Aggregation" width="90" x="581" y="136">
    <parameter key="attribute_name" value="Diff"/>
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value="prediction(label)"/>
    <parameter key="attributes" value="prediction(label)|label"/>
    <parameter key="include_special_attributes" value="true"/>
    <parameter key="aggregation_function" value="standard_deviation"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="715" y="136">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="label|prediction(label)"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="849" y="136">
    <parameter key="attribute_name" value="Diff"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="rmx_stat:discretize_quantiles" compatibility="2.1.692" expanded="true" height="124" name="Discretize by Quantiles (2)" width="90" x="983" y="85">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Diff"/>
    <parameter key="include_special_attributes" value="true"/>
    <parameter key="quantiles" value="10"/>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Cross Validation (2)" width="90" x="1117" y="85">
    <process expanded="true">
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="182" y="34"/>
    <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Decision Tree" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model (3)" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (3)" width="90" x="179" y="187">
    <list key="class_weights"/>
    </operator>
    <connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
    <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (3)" to_port="labelled data"/>
    <connect from_op="Performance (3)" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model on Test Set" width="90" x="983" y="340">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (3)" width="90" x="1117" y="340">
    <parameter key="attribute_name" value="label"/>
    <list key="set_additional_roles">
    <parameter key="prediction(label)" value="regular"/>
    </list>
    </operator>
    <connect from_op="Generate Data" from_port="output" to_op="Split Data into Training and Testing and Scoring" to_port="example set"/>
    <connect from_op="Split Data into Training and Testing and Scoring" from_port="partition 1" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Split Data into Training and Testing and Scoring" from_port="partition 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Cross Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_op="Cross Validation" from_port="test result set" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Set Role (2)" to_port="example set input"/>
    <connect from_op="Set Role (2)" from_port="example set output" to_op="Apply Model on Test Set" to_port="unlabelled data"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/>
    <connect from_op="Generate Aggregation" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Discretize by Quantiles (2)" to_port="example set input"/>
    <connect from_op="Discretize by Quantiles (2)" from_port="example set output" to_op="Cross Validation (2)" to_port="example set"/>
    <connect from_op="Cross Validation (2)" from_port="model" to_op="Apply Model on Test Set" to_port="model"/>
    <connect from_op="Cross Validation (2)" from_port="performance 1" to_port="result 1"/>
    <connect from_op="Apply Model on Test Set" from_port="labelled data" to_op="Set Role (3)" to_port="example set input"/>
    <connect from_op="Apply Model on Test Set" from_port="model" to_port="result 3"/>
    <connect from_op="Set Role (3)" from_port="example set output" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="210"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    @JEdward,

    clever. I know a PhD student I worked with earlier in my career who uses a DL model with tensorflow to predict abs(prediction-label) with some success. One needs to keep in mind what this includes. This way of doing is not covering the standard deviations on all model parameters nor does it cover the measurement uncertainty on all input attributes. The later is tough to cover at all.

     

    But, neat trick :)

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.