Nearest neighbours always gives the same prediction ¿? !!

traveriatraveria Member Posts: 2 Contributor I
edited November 2018 in Help
Hello, I am having an astonishing result:

I run a very simple test with nearest neighbors (see xml code below) and I am using a training dataset and an test dataset (see short datasets below).

The seldom result is that I always get the same value for the predicted value, despite the test example I use  :o

If I use the "ExampleSetGenerator" instead of reading a dataset in a file (activate it in the model I include below) I get a different prediction for every new test example I use, as it is expected.

Can anyone explain what is the reason for getting always the same prediction if I read data from a file?? ???

Any hint or solution will be welcomed!!!!

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSource" class="ExampleSource">
        <parameter key="attributes" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/prova_suma.aml"/>
    </operator>
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator" activated="no">
        <parameter key="number_examples" value="10000"/>
        <parameter key="target_function" value="polynomial"/>
    </operator>
    <operator name="NearestNeighbors" class="NearestNeighbors">
    </operator>
    <operator name="ExampleSource (4)" class="ExampleSource">
        <parameter key="attributes" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/prova_suma_test.aml"/>
        <parameter key="permutate" value="true"/>
    </operator>
    <operator name="ExampleSetGenerator (2)" class="ExampleSetGenerator" activated="no">
        <parameter key="number_examples" value="10"/>
        <parameter key="target_function" value="polynomial"/>
    </operator>
    <operator name="ExampleRangeFilter" class="ExampleRangeFilter">
        <parameter key="first_example" value="2"/>
        <parameter key="last_example" value="2"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier">
        <list key="application_parameters">
        </list>
        <parameter key="create_view" value="true"/>
        <parameter key="keep_model" value="true"/>
    </operator>
</operator>

TRAINING DATA

1 1.6 0.84 0 0.76
2 2.17 0.91 0.3 0.96
3 1.61 0.14 0.48 1
4 0.84 -0.76 0.6 1
5 0.74 -0.96 0.7 1
6 1.5 -0.28 0.78 1
7 2.5 0.66 0.85 1
8 2.89 0.99 0.9 1
9 2.37 0.41 0.95 1
10 1.46 -0.54 1 1
11 1.04 -1 1.04 1
12 1.54 -0.54 1.08 1
13 2.53 0.42 1.11 1
14 3.14 0.99 1.15 1
15 2.83 0.65 1.18 1
16 1.92 -0.29 1.2 1
17 1.27 -0.96 1.23 1
18 1.5 -0.75 1.26 1
19 2.43 0.15 1.28 1
20 3.21 0.91 1.3 1

TEST DATA

21 3.16 0.84 1.32 1
22 2.33 -0.01 1.34 1
23 1.52 -0.85 1.36 1
24 1.47 -0.91 1.38 1
25 2.27 -0.13 1.4 1
26 3.18 0.76 1.41 1
27 3.39 0.96 1.43 1
28 2.72 0.27 1.45 1
29 1.8 -0.66 1.46 1
30 1.49 -0.99 1.48 1

Answers

  • haddockhaddock Member Posts: 849 Maven
    Hi,

    I'm not sure the result is as surprising as you think. I can replicate your problem on your own data if I simply include the left hand column as a normal attribute, even though it looks looks much more like an Id attribute. If you treat it like one your "surprising" result disappears  :o So I think you should check your AML file to see how you've been handling that column.

    Here's some code to illustrate the point, if you leave "1'" as the value for "select_which" in the very first operator all the predictions are the same, but they are not all the same if you insert "2" instead. That is because the second example source marks column one as an Id column, whereas the first does not.
    <operator name="Root" class="Process" expanded="yes">
        <operator name="OperatorSelector" class="OperatorSelector" expanded="yes">
            <operator name="SimpleExampleSource" class="SimpleExampleSource">
                <parameter key="filename" value="C:\Users\CJFP\Documents\rm_workspace\prob.txt"/>
                <parameter key="label_column" value="2"/>
            </operator>
            <operator name="SimpleExampleSource (2)" class="SimpleExampleSource">
                <parameter key="filename" value="C:\Users\CJFP\Documents\rm_workspace\prob.txt"/>
                <parameter key="label_column" value="2"/>
                <parameter key="id_column" value="1"/>
            </operator>
        </operator>
        <operator name="IOMultiplier" class="IOMultiplier">
            <parameter key="io_object" value="ExampleSet"/>
        </operator>
        <operator name="ExampleRangeFilter" class="ExampleRangeFilter">
            <parameter key="first_example" value="1"/>
            <parameter key="last_example" value="20"/>
        </operator>
        <operator name="NearestNeighbors" class="NearestNeighbors">
        </operator>
        <operator name="IOSelector" class="IOSelector">
            <parameter key="io_object" value="ExampleSet"/>
        </operator>
        <operator name="ExampleRangeFilter (2)" class="ExampleRangeFilter">
            <parameter key="first_example" value="21"/>
            <parameter key="last_example" value="30"/>
        </operator>
        <operator name="ModelApplier" class="ModelApplier">
            <parameter key="keep_model" value="true"/>
            <list key="application_parameters">
            </list>
            <parameter key="create_view" value="true"/>
        </operator>
    </operator>
    I've attached the datafile with all 30 examples, you'll need to adjust the path to it in order to run the demo.

    [attachment deleted by admin]
  • traveriatraveria Member Posts: 2 Contributor I
    Many thanks Haddock,

    after some investigation I realize that the reason for the algorithm to produce the same prediction in all cases is that the examples dataset has not the same data description (metadata in the aml file) than in the test dataset, hence the algorithm does not know what to predict and produces all the time the last correct prediction.

    I still do not understand why both datasets have not the same structure. Try the minimalist file at the end of the message to realize that it is so: what it writes first is not the same as it writes afterwards.

    After solving this little inconvenience I can run the Nearest Neighbors correctly.

    Many thanks for your comments anyway ;D!!!!!

    Miquel

    <?xml version="1.0" encoding="UTF-8"?>
    <process version="4.2">

      <operator name="Root" class="Process" expanded="yes">
          <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
              <parameter key="number_examples" value="1000"/>
              <parameter key="target_function" value="polynomial"/>
          </operator>
          <operator name="ExampleSetWriter (2)" class="ExampleSetWriter">
              <parameter key="attribute_description_file" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set.aml"/>
              <parameter key="example_set_file" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set.dat"/>
              <parameter key="quote_whitespace" value="false"/>
          </operator>
          <operator name="SimpleExampleSource (2)" class="SimpleExampleSource">
              <parameter key="filename" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set.dat"/>
              <parameter key="label_column" value="6"/>
              <parameter key="use_quotes" value="true"/>
          </operator>
          <operator name="FeatureRangeRemoval" class="FeatureRangeRemoval">
              <parameter key="first_attribute" value="6"/>
              <parameter key="last_attribute" value="6"/>
          </operator>
          <operator name="ExampleSetWriter" class="ExampleSetWriter">
              <parameter key="attribute_description_file" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set_test.aml"/>
              <parameter key="example_set_file" value="/home/miquel/Documents/I+D+I/MICROPREDICCIO/RapidMiner WORKSPACE/PFC 2008-2009/proves nns/polinomi_set_test.dat"/>
              <parameter key="quote_whitespace" value="false"/>
          </operator>
      </operator>

    </process>
  • haddockhaddock Member Posts: 849 Maven
    Hi,

    It is rather difficult to comment on this unless you show what you put in "polinomi_set.aml", perhaps you will oblige us?

    However, there are things that are obvious, whatever you put in that file....

    1.The generator produces 5 attributes and 1 label= 6 columns.

    2. Removing  attribute number 6 cannot work, unless there are 6 attributes.

    3. There can only be 6 attributes if the label column is set to 0.

    4. But in your code it is marked as being in column 6!

    5. So this code NEVER could work, whatever is in "polinomi_set.aml".

    Which leaves me with a question, what on earth were you trying to achieve with this post?
Sign In or Register to comment.