preparing data for mining

grafikbggrafikbg Member Posts: 14 Contributor I
edited December 2018 in Help

hello, i am wondering if there is someone willing to help of an absolute novice in data preparing. we have a 124 electrical controllers, named for comfort from 1 to 124 on each shift some of them switch off and cause troubles. would you help me trough the process to create a excel sheat and run the prediction which of them is most likely to switch off the next shift. a can correct the output after each shift and to improve the results but i will need help. thank you in advance... i voted for rm:)



  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @grafikbg,


    Could you describe what data exactly do you have available ? What are the structure of these datas ?

    for example something like that : 






  • grafikbggrafikbg Member Posts: 14 Contributor I

    thank you very much Lionel your support is extremely valuable for me. in the moment the data looks like that, /first column the controllers numbered from one to 124, then the info from 25 shifts with "x" the controller that switched off/, but i can transform it in any way that will works.

  • grafikbggrafikbg Member Posts: 14 Contributor I


  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @grafikbg,


    I'm not electronic expert so to better understand, when there is one (or more) "x" in a row, the associated controller of the row is switched off ?


    for example in your first row, there is a "x" in the column 12 ==>that means the controller 1 is switched off ?

     a controller is on if and only if, all the values in its row are  "on" ?


     "x" is a binary variable ("on" or "x") or a real variable x is in range [1,124] ?





  • grafikbggrafikbg Member Posts: 14 Contributor I

    thank you Lionel, i will try to explain it:

    1. yes you are right - "x" in the column 12, means that on 12th shift the controler number one is switched off, respectfully the controler number two is sitched off on the 4th, 6th, 14th and 23th shifts. we need to predict on the next shift /27/ which controllers are most likely to switch off. i probably done a mistake with that "x", i am using it as check mark not as math symbol. a could easily change it to "off" or whatsoever will be easier for rm to interprete as data.

    2. if, all the values in its row are  "on", this means that this controller was never switch off during the observed period.

    3.  "x" is a binary variable ("on" or "x") or a real variable x is in range [1,124] ? unfortunatelly i am not good with math:)... in the begining of each shift all controllers supposed to be "on", but some of them accidentally  are "off"...

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @grafikbg,


    Difficult problem ......

    That's how I see things : 

    I think it's impossible to forecast the controllers which will switch off based only on your data at time t (your excel file).

    you have to de-pivot and transpose your excel file to have 3324 attributes : The result is like that : 



    It's a time series problem, so you have now to build a database of history of the statut of your 3224 shifts (124 x 26) with a time step of 15 min for example.

    After that based on the historical data, using a time series process, you can train a model and then apply it to forecast the controllers

    which will switch off.


    I hope it helps,





  • grafikbggrafikbg Member Posts: 14 Contributor I

    thank you Lionel... time doesn't matter. each shift begins with restarting of the system, few controllers switch off, we go and switch them on manually, then they work flawlessly till next restart. it only happens once in the begining ow each workshift during the restart. 

  • grafikbggrafikbg Member Posts: 14 Contributor I

    we are hoping with your valuable support to achieve an rm output like:

    ...during the next workshift there is 85% possibility that controller number 23 will switch off, 84% that controller number 49 will switch off and ect.... even 50% accuracy will cut off our delay time by half... we can fill of the data after each shift for the algorithm to get smarter and encrease the acuracy...

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Dear all,


    To introduce this post, I would quote the french humorist Pierre Dac : 

    "Forecasts are difficult especially when they relate to the future."


    After cogitation, I do not see how to predict (with an associated probability) which controller(s) will switch off during the next workshift, with only the provided dataset.

    I considered a time to work with "association rules", but it's not conclusive.

    So if a guru of predictive maintenance has an idea of the method to apply on this case study, I will be curious to know it.








  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    so first of all I thank @lionelderkrikor for both his insights and bringing a sense of culture to this community. I have not heard that quotation before and it is quite a propos!


    So the "time to work" is an interesting question. I was just working with a customer last week on this exact same issue. Let me poke around and see what I can find.



  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @grafikbg @lionelderkrikor @sgenzer


    I've been following this interesting thread since some time and didn't find an answer for one crucial (at least in my opinion) question: 


    Are those controllers independent, or put into some kind of electrical chain (?) which makes the whole system connected? 


    In this sense, for example, does fail in controller #1 directly cause fail in another controller #X?

    Or each controller actually works independently from all the others?



  • grafikbggrafikbg Member Posts: 14 Contributor I

    thank you vladimir, the controllers are independant... i forgot to mention that there is obvious pattern...for example during the 27 now observed workshifts the controller number 16 switched off 9 times, but controller 42 never... and there are many that switched of 6 or five times till other 1 or two times or not at all... we have had no less than six and no more than eleven contrillers from all 124 that switched off during the restart.

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @grafikbg


    Well, I am not an electric engineer by any means, but I try to reason just using common sense :) 


    Theoretically, if all controllers were dependant (and connected into some sort of circuit), you could predict one controller state based on another 123 controllers, most likely using time series approach. 


    But it seems that if controllers are independant, then the event of controller #1 switching off at some point by no means causes another event of switching off controoler #N (any other). This said, the whole task disintegrates into 124 separate tasks of predicting the next state of each controller independently. For this kind of prediction, it's definitely not enough data. I don't think you can efficiently predict each controller's state based ONLY on an observed pattern, at least it won't make a practical sense: if controller #X switches off every week, you could expect it to switch off next week also, but this doesn't take into account the affecting factors; if controller #Y never switched off, you might expect it to continue working flawlessly... but again, this is not true in real life. 


    To have meaningful prediction, for each controller you would need at least few meaningful data points, which directly or indirectly may affect on its state, such as:


    • total time in service
    • total number of repairs
    • average / max electrical load
    • time since last fail 
    • some runtime characteristics (current voltage, resistance, whatever else)
    • etc etc etc

    Hope this reasoning helps.

  • grafikbggrafikbg Member Posts: 14 Contributor I

    thank you vladimir, you are probably right... but as i mentioned before even the 50% acuracy will cut off our delay time by half, even two or three from 10 will give us 20 valuable minutes... so i was thinking..starting from here and adding new data after each shift slowly to achieve more  

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,453 RM Data Scientist

    Hi @grafikbg,

    independend on the dependendcy between the switches you can still model it with a simple windowing. The difference would be the format going into the windowing. In case of independence you would create a data set like this:


    TimeStamp Id OffIndicator



    Use Group into Collection and group by ID. Inside you use a windowing operator.

    Is there any chance to get more data then just "died"? E.g. amplitudes etc?


    Attached is a process shwoing the idea. It needs value series and toolbox to run.






    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="concurrency:loop" compatibility="8.2.000" expanded="true" height="82" name="Loop" width="90" x="112" y="34">
    <process expanded="true">
    <operator activated="true" class="generate_data" compatibility="8.2.000" expanded="true" height="68" name="Generate Data" width="90" x="246" y="85"/>
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="380" y="85">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="att1"/>
    <operator activated="true" class="generate_id" compatibility="8.2.000" expanded="true" height="82" name="Generate ID" width="90" x="514" y="85"/>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="648" y="85">
    <list key="function_descriptions">
    <parameter key="timestamp" value="date_add(date_now(),id,DATE_UNIT_DAY)"/>
    <parameter key="id" value="%{a}"/>
    <parameter key="att1" value="if(rand()&gt;0.9,&quot;Broken&quot;,&quot;Not Broken&quot;)"/>
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes (2)" width="90" x="849" y="85">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="label"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <operator activated="true" class="rename" compatibility="8.2.000" expanded="true" height="82" name="Rename" width="90" x="983" y="85">
    <parameter key="old_name" value="att1"/>
    <parameter key="new_name" value="OffIndicator"/>
    <list key="rename_additional_attributes"/>
    <operator activated="true" class="rename" compatibility="8.2.000" expanded="true" height="82" name="Rename (2)" width="90" x="1117" y="85">
    <parameter key="old_name" value="id"/>
    <parameter key="new_name" value="CircuitId"/>
    <list key="rename_additional_attributes"/>
    <connect from_op="Generate Data" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
    <connect from_op="Generate ID" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
    <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_op="Rename (2)" to_port="example set input"/>
    <connect from_op="Rename (2)" from_port="example set output" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    <operator activated="true" class="append" compatibility="8.2.000" expanded="true" height="82" name="Append" width="90" x="246" y="34"/>
    <operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role" width="90" x="380" y="34">
    <parameter key="attribute_name" value="CircuitId"/>
    <parameter key="target_role" value="cid"/>
    <list key="set_additional_roles">
    <parameter key="timestamp" value="id"/>
    <operator activated="true" class="operator_toolbox:group_into_collection" compatibility="1.0.000" expanded="true" height="82" name="Group Into Collection" width="90" x="581" y="34">
    <parameter key="group_by_attribute" value="CircuitId"/>
    <operator activated="true" class="loop_collection" compatibility="8.2.000" expanded="true" height="82" name="Loop Collection" width="90" x="715" y="34">
    <process expanded="true">
    <operator activated="true" class="sort" compatibility="8.2.000" expanded="true" height="82" name="Sort" width="90" x="112" y="34">
    <parameter key="attribute_name" value="timestamp"/>
    <operator activated="true" class="series:windowing" compatibility="7.4.000" expanded="true" height="82" name="Windowing" width="90" x="447" y="34">
    <parameter key="window_size" value="7"/>
    <parameter key="create_label" value="true"/>
    <parameter key="label_attribute" value="OffIndicator"/>
    <connect from_port="single" to_op="Sort" to_port="example set input"/>
    <connect from_op="Sort" from_port="example set output" to_op="Windowing" to_port="example set input"/>
    <connect from_op="Windowing" from_port="example set output" to_port="output 1"/>
    <portSpacing port="source_single" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    <operator activated="true" class="append" compatibility="8.2.000" expanded="true" height="82" name="Append (2)" width="90" x="849" y="34"/>
    <operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="145" name="Validation" width="90" x="1050" y="34">
    <parameter key="sampling_type" value="stratified sampling"/>
    <process expanded="true">
    <operator activated="false" class="concurrency:parallel_decision_tree" compatibility="8.2.000" expanded="true" height="103" name="Decision Tree" width="90" x="45" y="289">
    <parameter key="apply_pruning" value="false"/>
    <parameter key="minimal_gain" value="0.01"/>
    <operator activated="true" class="h2o:logistic_regression" compatibility="8.2.000" expanded="true" height="124" name="Logistic Regression" width="90" x="45" y="34"/>
    <connect from_port="training set" to_op="Logistic Regression" to_port="training set"/>
    <connect from_op="Logistic Regression" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    <description align="left" color="green" colored="true" height="80" resized="true" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    <operator activated="true" class="performance" compatibility="8.2.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <connect from_op="Performance" from_port="example set" to_port="test set results"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    <description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).&lt;br/&gt;The performance is evaluated and sent to the operator results.</description>
    <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
    <connect from_op="Loop" from_port="output 1" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Group Into Collection" to_port="exa"/>
    <connect from_op="Group Into Collection" from_port="col" to_op="Loop Collection" to_port="collection"/>
    <connect from_op="Loop Collection" from_port="output 1" to_op="Append (2)" to_port="example set 1"/>
    <connect from_op="Append (2)" from_port="merged set" to_op="Validation" to_port="example set"/>
    <connect from_op="Validation" from_port="model" to_port="result 1"/>
    <connect from_op="Validation" from_port="performance 1" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>


    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • grafikbggrafikbg Member Posts: 14 Contributor I

    thank you very much for your support Martin... i am not that good to figure out all that info but it's encouraging...

Sign In or Register to comment.