image

🎉 🎉 RAPIDMINER 9.10 IS OUT!!! 🎉🎉

Download the latest version helping analytics teams accelerate time-to-value for streaming and IIOT use cases.

CLICK HERE TO DOWNLOAD

How to find out average ratios between more than 2 variables ?

CaptainChaosCaptainChaos Member Posts: 17  Maven
Hi Guys,

My Data is in the Format shown below:
Vechicleid   Drivendistance    Time      TotalConsumption Weight      DriverNote      Routdificulty  1   582  39060      143   27     9    3.5 2   478    45980     135   38    9,3    4,4

Its real Data from a Transporting agency where I am doing my bachelor thesis at the moment. I will first explain the data even if most of it should pretty clear. The first Attribute "id" is just the id of the vehicle sending the Data. Second Attribute "DrivenDistance" is the Distance the truck traveld, the the third attribute is the Time the Truck travlled in seconds, the fourth attribute are the litres the truck used for the traveled distance, the fith attribute "weight" is the averrage weight of the truck during the journey, the six attribute is the note calculate for the driver beacause of his style of driving and the seventh attribute "Routdificulty" means how hard the rout to drive is, that means for example driving thorugh the mountains with a lot of weight and speed will give a higher mark.

So what i would like to find out is how the ratio between such variable is in average to check the plausibility of each veriable. For example i would like to make conclusions like:" If the DrivenDistance, time,weight,Routdificult, DriverNote the TotalConsumption should be between x and y".
So i started to calculate the correlation between the attributes and they are pretty weak with one exception the TotalConsumption is strongly correlated to Drivendistance (0,944) which is pretty logical. But i know from field tests that the Weight and Routdificulty should influence it more than the correlation schows (0.0177 and 0.22).

So my question is if  there is anyway to find out/make conclusion about the ratios between more than 2 variables? should i use another method than the correlation matrix?  or should i change my process listed below?:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Process">
    <process expanded="true" height="446" width="628">
      <operator activated="true" class="read_excel" compatibility="5.1.006" expanded="true" height="60" name="Read Excel" width="90" x="45" y="120">
        <parameter key="excel_file" value="C:\Users\Rojas\Desktop\BA_A-z\Analyse\Rapidminer_Forum.xls"/>
        <parameter key="imported_cell_range" value="A1:G77"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="VEHICLEID.true.integer.id"/>
          <parameter key="1" value="DrivenDistance.true.numeric.attribute"/>
          <parameter key="2" value="Time.true.integer.attribute"/>
          <parameter key="3" value="TotalConsumptio.true.real.attribute"/>
          <parameter key="4" value="Weight.true.numeric.attribute"/>
          <parameter key="5" value="DriverNote.true.real.attribute"/>
          <parameter key="6" value="RoutDificulty.true.real.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="correlation_matrix" compatibility="5.1.006" expanded="true" height="94" name="Correlation Matrix" width="90" x="246" y="120">
        <parameter key="squared_correlation" value="true"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Correlation Matrix" to_port="example set"/>
      <connect from_op="Correlation Matrix" from_port="example set" to_port="result 1"/>
      <connect from_op="Correlation Matrix" from_port="matrix" to_port="result 2"/>
      <connect from_op="Correlation Matrix" from_port="weights" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>



Any advice would be highly apreciatted (if i didnt explained it in suffiecient detail or logicaly enough please ask me - english isnt my naitive languae)  ;D

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869   Unicorn
    Aye Captain,

    there are other means of estimating the value of an attribute. Try e.g. the operators in the Attribute Weighting group (Weight by ...). Also some classifiers create an attribute weighting, e.g. the SVM. If the classifier is capable of detecting attribute interactions, it can deliver more valuable results than the simple good old, but 1-attribute-based correlation.

    Good luck for your thesis!

    -Marius
  • CaptainChaosCaptainChaos Member Posts: 17  Maven
    Hi Marius,
    thanks for your answer but the Captain still is off course..........
    I tried out the SVM Model but dont know if i used it right so I am asking you some more questions sorry for that. If i understood right using the SVM I have to select one attribute as a label and on as the attribute to be predicted hope iam right so far please correct if not. The problem i have is that the data set isn't labeled yet so i chose the id as a label, what maybe is a bit stupid because now each label just consists of only one data record. So next questions is should I try to put the records first in to classes/labels an secondly, would it be useful to put the values of the TotalConsumption into intervals like 30-32 and so on because i think predicting values like 30.46 is not very likely.

    Furthermore i would like to know what the values shown by the Kernel Model SVM mean:                                                
    functionvalue alpha abs(alpha) SupportVector DrivenDistance   Time  Weight  Drivernot
          774.329      -0.200 0.200    SupportVector 1.128 1.010 -0.766  0.988
    I dont understand what the function value, alpha, abs(alpha) signify so what it stands for...  neither why under the other attributes are values close to 1 negative and possitive what do the values tell me ?

    Las but not least i am not getting any result on the Performance Vector(SVM) just shows to messages: " svm_objective_function: -6327798.250 " and "no_support_vectors: 76.000  "

    my Process:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.000" expanded="true" name="Process">
        <process expanded="true" height="467" width="815">
          <operator activated="true" class="read_excel" compatibility="5.2.000" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
            <parameter key="excel_file" value="C:\Dokumente und Einstellungen\rrojas\Desktop\Rapidminer_Forum.xls.xls"/>
            <parameter key="imported_cell_range" value="A1:G77"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="VEHICLEID.true.integer.attribute"/>
              <parameter key="1" value="DrivenDistance.true.numeric.attribute"/>
              <parameter key="2" value="Time.true.integer.attribute"/>
              <parameter key="3" value="TotalConsumptio.true.real.attribute"/>
              <parameter key="4" value="Weight.true.numeric.attribute"/>
              <parameter key="5" value="DriverNote.true.real.attribute"/>
              <parameter key="6" value="RoutDificulty.true.real.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.2.000" expanded="true" height="76" name="Set Role (2)" width="90" x="112" y="210">
            <parameter key="name" value="TotalConsumptio"/>
            <parameter key="target_role" value="prediction"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.2.000" expanded="true" height="76" name="Set Role" width="90" x="313" y="75">
            <parameter key="name" value="VEHICLEID"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="support_vector_machine" compatibility="5.2.000" expanded="true" height="112" name="SVM" width="90" x="514" y="75"/>
          <connect from_op="Read Excel" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="SVM" to_port="training set"/>
          <connect from_op="SVM" from_port="model" to_port="result 1"/>
          <connect from_op="SVM" from_port="estimated performance" to_port="result 2"/>
          <connect from_op="SVM" from_port="exampleSet" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>

    Sorry for asking that much but the manuel didn't helped me with any of this issues .....!!
    Thanks for any solution, suggestion or explanations in advance  ;)

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869   Unicorn
    Hi,

    lets start with your dataset: you are right, the svm needs a label to work correctly. However, this label does not need to be categorical: you can also predict continuous values - this is called regression (in contrast to classification[i/i] for categorical values).

    Now let's have a look at your process. First of all, you said you are interested in weights, but you did not connect the weights output of the svm - you should do that :) You are probably not interested in all other ports in your context :)

    Defining the id as label does not make any sense, as you correctly stated. But you said that probably the relation of one or several attributes to another attribute might be interesting - set that attribute as label.

    Last but not least I think you would profit from getting a deeper understanding of data mining in general and with the help of RapidMiner. At least for the latter our video tutorials on our website (also linked from the post in my signature) are a good start.

    Best,
    Marius
Sign In or Register to comment.