Options

Predicting Unknowns from Known's via Supervised

sunnyalsunnyal Member Posts: 44 Contributor II
edited December 2018 in Help

Hi,

 

We are trying to model revenue assurance predictive model in identifying the possible electricity theft. Our approach is to take the already known (theft meter hourly reads) and predict if any other meters follow similar usage patterns (anomalies and pattern matching to fraud).

 

The ratio is we have around 400 known theft meters and 110k unknown. As you can see we have very small ratio of known that we need to match up with unknowns(example set). I have tried KNN,GBT and Naive Bayes and tracking the performance using "Performance Binominal classification" (i.e.) LABEL=FRAUD =TRUE/FALSE. Also, Tried SVM as recommend by most research papers and its performance was terrible, trying parameter optimization and it is running from 2 days:-(

 

Below are my questions

 

(1) What would be the best supervised machine learning algorithms for these kind of prediction classifications?

(2) Also, how do we feed back the confirmed false positive meters as not theft to the model, so that model refines and start treating these as not theft and yields a better output(prediction)-Would appreciate if you can share a sample process on how to perform a feedback to model

 

Thx for the valuable input.

Tagged:

Answers

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    You may want to try the one-class label SVM approach instead and focus on the characteristics of the known fraud cases.  There is a related thread discussion here you should review with a link to a sample process: https://community.rapidminer.com/t5/Getting-Started-Forum/One-class-label-learning/m-p/44038#M1350

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    sunnyalsunnyal Member Posts: 44 Contributor II

    Thank you. How diffeent is this one-class as oppsoed to C-SVC or radial?? The current problem with other svm types is that they are terribly slow..

     

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I suspect the reason the current SVM is so slow is because of the large number of examples of the "unknown" class.  If you are using only the "known" class, which is much smaller, then the SVM algorithm will be much faster.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    What @Telcontar120 said. Focus on training the 'knowns' and go from there. 

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

  • Options
    sunnyalsunnyal Member Posts: 44 Contributor II

    Thank you guys. I liked Rumsfeld analogy :-)

     

    I trained "Knowns" (True’s) with C-SVC and then tested with "Unknows" (False) and it just predicted everything as True. misery..

     

    I wanted to try "one-class", but SVM operator complains about not supported binominal (True/False) or numerical (1/0) labels.

     

    How do we define a label as "one class"?? see attached my process

     

  • Options
    sunnyalsunnyal Member Posts: 44 Contributor II

    Attached sample data

  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    @sunnyal Loading in your sample data you can do something like this. With the "one class" application you just train the model on the knowns and exclude the other class completely. Then when it scores it generates how far inside or outside you are from what it trained one.

     

    Note this is just a sample template, I think you're going to have to do some feature generation to make it better).  Just make sure to set your Meters to an ID role.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.002" expanded="true" height="68" name="Retrieve Electric Fraud Sample Data" width="90" x="45" y="34">
    <parameter key="repository_entry" value="../data/Electric Fraud Sample Data"/>
    </operator>
    <operator activated="true" class="nominal_to_date" compatibility="7.6.002" expanded="true" height="82" name="Nominal to Date" width="90" x="179" y="34">
    <parameter key="attribute_name" value="DIM_DT_ID"/>
    <parameter key="date_type" value="date_time"/>
    <parameter key="date_format" value="yyyy-MM-dd"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.002" expanded="true" height="82" name="Set Label (2)" width="90" x="313" y="34">
    <parameter key="attribute_name" value="METER"/>
    <parameter key="target_role" value="id"/>
    <list key="set_additional_roles">
    <parameter key="METER" value="id"/>
    </list>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.6.002" expanded="true" height="103" name="Filter Examples" width="90" x="514" y="136">
    <parameter key="invert_filter" value="true"/>
    <list key="filters_list">
    <parameter key="filters_entry_key" value="FAULT_INDICATOR.equals.FALSE"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.002" expanded="true" height="82" name="Select Attributes (2)" width="90" x="916" y="187">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="FAULT_INDICATOR"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="guess_types" compatibility="7.6.002" expanded="true" height="82" name="Guess Types" width="90" x="648" y="34"/>
    <operator activated="true" class="set_role" compatibility="7.6.002" expanded="true" height="82" name="Set Label (3)" width="90" x="782" y="34">
    <parameter key="attribute_name" value="FAULT_INDICATOR"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles">
    <parameter key="METER" value="id"/>
    </list>
    </operator>
    <operator activated="true" class="support_vector_machine_libsvm" compatibility="7.6.002" expanded="true" height="82" name="SVM" width="90" x="916" y="34">
    <parameter key="svm_type" value="one-class"/>
    <parameter key="gamma" value="0.001"/>
    <list key="class_weights"/>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.6.002" expanded="true" height="82" name="Apply Model" width="90" x="1117" y="85">
    <list key="application_parameters"/>
    </operator>
    <connect from_op="Retrieve Electric Fraud Sample Data" from_port="output" to_op="Nominal to Date" to_port="example set input"/>
    <connect from_op="Nominal to Date" from_port="example set output" to_op="Set Label (2)" to_port="example set input"/>
    <connect from_op="Set Label (2)" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Guess Types" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="original" to_op="Select Attributes (2)" to_port="example set input"/>
    <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Guess Types" from_port="example set output" to_op="Set Label (3)" to_port="example set input"/>
    <connect from_op="Set Label (3)" from_port="example set output" to_op="SVM" to_port="training set"/>
    <connect from_op="SVM" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

  • Options
    sunnyalsunnyal Member Posts: 44 Contributor II

    Tom,

     

    Thank you. After modyfing my design as per teh sample I get all 400k examples treated as "outside". I guess SVM isnt doing right thing for me. When I use Naive Bayes or GBT I get some predictions though, but way too many fasle postives.

     

    To further refine my other working models, is there a way we can feed the confirmed false positive meters as an additional input data as a feed back (not theft/false postive) to the model, so that model refines and start treating these as not theft and yields a better output(prediction)?

     

    Thx

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,509 RM Data Scientist

    Hi,

     

    What you describe is Boosting. This is the technique GBTs are using internally.

     

    Did you run a Grid optimize for GBT and SVMs? What kernels did you try?

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    sunnyalsunnyal Member Posts: 44 Contributor II

    Hi Martin,

     

    Thanks for your note.

     

    Yes, I tried optimizing parameters for SVM and it didn’t yield much of benefit. I used rbf kernel for SVM and tried optimizing SVM for Gamma and C values, but it was running for 2 days and still going. I tried limiting example set and optimize for only actual known theft and yet it results were terrible. I also tried GBT, but not better results. Can you suggest me the what parameters and appropriate values one should optimize for GBT?? However, Naive Bayes yielded a better result than any other learners as it predicted few flat line power consumption (which are possible candidates), However, all of them seem false positives when we actually investigated those homes. As such, is there any way we can feed these false positives back to NB or GBT model to not treat these meters as positives??

     

    Thanks for your support

     

Sign In or Register to comment.