Model Applier Output Misassinging Internal Mappings for Nominal Values

martynsmartyns Member Posts: 15 Maven
edited November 2018 in Help
Following up the Model Applier problems of the past in terms of internal nominal mappings, I am still having problems! It seems that Rapidminer is having trouble with Nominal values that are not first in the list in the aml files with the model applier.

Following the work-around in the first step I load my training data (attached) from an excel file, write it out with ExampleSetWriter, load it back in with ExampleSource, create a model and then write the model:
<operator name="Root" class="Process" expanded="yes">
    <operator name="ExcelExampleSource" class="ExcelExampleSource">
        <parameter key="excel_file" value="D:\ADDU\Share\RapidMiner\RapidTrain.xls"/>
        <parameter key="first_row_as_names" value="true"/>
        <parameter key="create_label" value="true"/>
        <parameter key="label_column" value="9"/>
        <parameter key="create_id" value="true"/>
        <parameter key="id_column" value="8"/>
    </operator>
    <operator name="ExampleSetWriter" class="ExampleSetWriter">
        <parameter key="example_set_file" value="D:\ADDU\Share\Rapidminer\train.dat"/>
        <parameter key="attribute_description_file" value="D:\ADDU\Share\Rapidminer\train.aml"/>
        <parameter key="overwrite_mode" value="overwrite"/>
    </operator>
    <operator name="ExampleSource" class="ExampleSource">
        <parameter key="attributes" value="D:\ADDU\Share\RapidMiner\train.aml"/>
    </operator>
    <operator name="W-J48" class="W-J48">
    </operator>
    <operator name="ModelWriter" class="ModelWriter">
        <parameter key="model_file" value="D:\ADDU\Share\Rapidminer\J48.mod"/>
        <parameter key="output_type" value="XML"/>
    </operator>
</operator>
Next, I read in a test set consisting of a single example from an excel file (temp.xls) and write it out with the example set writer. I guess this step isn't strictly necessary but it is helpful in what is to come:
<operator name="Root" class="Process" expanded="yes">
    <operator name="ExcelExampleSource" class="ExcelExampleSource" breakpoints="after">
        <parameter key="excel_file" value="D:\ADDU\Share\Rapidminer\temp.xls"/>
        <parameter key="first_row_as_names" value="true"/>
        <parameter key="create_label" value="true"/>
        <parameter key="label_column" value="9"/>
        <parameter key="create_id" value="true"/>
        <parameter key="id_column" value="8"/>
    </operator>
    <operator name="ExampleSetWriter" class="ExampleSetWriter">
        <parameter key="example_set_file" value="D:\ADDU\Share\Rapidminer\temp.dat"/>
        <parameter key="attribute_description_file" value="D:\ADDU\Share\Rapidminer\temp.aml"/>
        <parameter key="overwrite_mode" value="overwrite"/>
    </operator>
</operator>
THIS PART IS THE WORKAROUND: I now manually open train.aml and temp.aml. I copy all of the attribute information from train.aml over the attribute information in temp.aml so that all of the attribute information in both files is exactly the same.

In the third part I apply the model to a new instance of test data, for this run I have used the same temp.xls. This is what I call for my real world prediction stuff. I load the temp.xls, then using ExampleSetWriter I only write out the temp.dat file so as to preserve all of the correct attribute information copied in the workaround above. I have stuck in an IOConsumer just as a control method for testing.

I then load the test example using ExampleSource to load temp.aml. I have a FeatureIterator to scrub out any missing data which in our set is represented with 999, I load the model and apply it and then write out the prediction.
<operator name="Root" class="Process" expanded="yes">
    <operator name="ExcelExampleSource" class="ExcelExampleSource">
        <parameter key="excel_file" value="D:\ADDU\Share\Rapidminer\temp.xls"/>
        <parameter key="first_row_as_names" value="true"/>
        <parameter key="create_label" value="true"/>
        <parameter key="label_column" value="9"/>
        <parameter key="create_id" value="true"/>
        <parameter key="id_column" value="8"/>
    </operator>
    <operator name="ExampleSetWriter" class="ExampleSetWriter">
        <parameter key="example_set_file" value="D:\ADDU\Share\Rapidminer\temp.dat"/>
        <parameter key="overwrite_mode" value="overwrite"/>
    </operator>
    <operator name="IOConsumer" class="IOConsumer">
        <parameter key="io_object" value="ExampleSet"/>
    </operator>
    <operator name="ExampleSource" class="ExampleSource" breakpoints="after">
        <parameter key="attributes" value="D:\ADDU\Share\RapidMiner\temp.aml"/>
    </operator>
    <operator name="FeatureIterator" class="FeatureIterator" expanded="yes">
        <parameter key="work_on_input" value="false"/>
        <operator name="Mapping" class="Mapping">
            <parameter key="attributes" value="%{loop_feature}"/>
            <list key="value_mappings">
            </list>
            <parameter key="replace_what" value="999"/>
            <parameter key="replace_by" value="?"/>
        </operator>
    </operator>
    <operator name="ModelLoader" class="ModelLoader">
        <parameter key="model_file" value="D:\ADDU\Share\Rapidminer\J48.mod"/>
    </operator>
    <operator name="ModelApplier" class="ModelApplier" breakpoints="after">
        <list key="application_parameters">
        </list>
    </operator>
    <operator name="ExcelExampleSetWriter" class="ExcelExampleSetWriter">
        <parameter key="excel_file" value="D:\ADDU\Share\Rapidminer\RapidminerPrediction.xls"/>
    </operator>
</operator>
Now here is the problem. The output file has only ? where there should be data!

For example,
SEX         MARSTAT         EDUC         EMPLOY         ACCOM     SF36PHY1 GROUP UR      SUCCESS
Female Current Long-Term Senior (Yr 12) Unemployed Own Home         CBT only 191.00


becomes this in the output:
UR    SUCCESS SEX MARSTAT         EDUC  EMPLOY  ACCOM   SF36PHY1 GROUP prediction(SUCCESS)
191.0     ?         Current Long-Term ?    ?            ?                         CBT only Unsuccessful        
                                                                                                               
confidence(Unsuccessful) confidence(Successful)
.7                                         .3

Now, lets take a look at the .aml files. You will notice below that the only nominal variable that is being written out is MARSTAT, Current Long-Term. It is the only nominal variable which appears [glow=red,2,300]FIRST[/glow] in the aml files. So at least for the writing out after the model applier only the first nominal variables are working.
<?xml version="1.0" encoding="windows-1252"?>
<attributeset default_source="train.dat">

  <attribute
    name         = "SEX"
    sourcecol    = "1"
    valuetype    = "nominal">
       <value>Male</value>
       <value>Female</value>
  </attribute>

  <attribute
    name         = "MARSTAT"
    sourcecol    = "2"
    valuetype    = "nominal">
       <value>Current Long-Term</value>
       <value>Previous Long-Term</value>
       <value>Single</value>
  </attribute>

  <attribute
    name         = "EDUC"
    sourcecol    = "3"
    valuetype    = "nominal">
       <value>Uni</value>
       <value>Senior (Yr 12)</value>
       <value>Junior (Yr 10)</value>
       <value>Primary</value>
       <value>Tertiary (Non-Uni)</value>
  </attribute>

  <attribute
    name         = "EMPLOY"
    sourcecol    = "4"
    valuetype    = "nominal">
       <value>Employed</value>
       <value>Unemployed</value>
       <value>Student</value>
  </attribute>

  <attribute
    name         = "ACCOM"
    sourcecol    = "5"
    valuetype    = "nominal">
       <value>Rent</value>
       <value>Own Home</value>
       <value>Other</value>
  </attribute>

  <attribute
    name         = "SF36PHY1"
    sourcecol    = "6"
    valuetype    = "real"/>

  <attribute
    name         = "GROUP"
    sourcecol    = "7"
    valuetype    = "nominal">
       <value>CBT only</value>
       <value>Combination</value>
       <value>Refuseniks</value>
       <value>Acamprosate</value>
       <value>St Judes</value>
       <value>Naltrexone</value>
  </attribute>

  <id
    name         = "UR"
    sourcecol    = "8"
    valuetype    = "integer"/>

  <label
    name         = "SUCCESS"
    sourcecol    = "9"
    valuetype    = "nominal">
       <value>Unsuccessful</value>
       <value>Successful</value>
  </label>

</attributeset>
Now, lets use a test set which only consists of first nominal values (attached as tempfirst, you will have to rename it to temp to use my code above).

It works! Confirming my theory.

UR SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS) confidence(Unsuccessful) confidence(Successful)
191.0 Male Current Long-Term Uni Employed Rent CBT only Unsuccessful .7 .3

Now with a file where the first nominal value is never present (attached as tempallnotfirst, rename to temp to use) and as expected we have
UR SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS) confidence(Unsuccessful) confidence(Successful)
191.0 ? ? ? ? ? ? Successful .4 .6

Now, going back to our original temp file we can take a look at the DataTable tab at the end of the experiment: Its a bit messy but I have highlighted a few examples of data that goes missing below for EDUC and EMPLOY. In both cases in the statistics column the mode is unknown! but the information is still available in the range column!!

id UR integer avg = 191 +/- 0 [191.000 ; 191.000] 0.0
prediction prediction(SUCCESS) nominal mode = Unsuccessful (1), least = Successful (0) Unsuccessful (1), Successful (0) 0.0
confidence_Unsuccessful confidence(Unsuccessful) real avg = 0.666 +/- 0 [0.666 ; 0.666] 0.0
confidence_Successful confidence(Successful) real avg = 0.334 +/- 0 [0.334 ; 0.334] 0.0
regular SEX nominal mode = unknown Female (0) 0.0
regular MARSTAT nominal mode = Current Long-Term (1), least = Current Long-Term (1) Current Long-Term (1) 0.0
regular EDUC nominal mode = unknown [glow=red,2,300]Senior (Yr 12) (0)[/glow] 0.0
regular EMPLOY nominal mode = unknown [glow=red,2,300]Unemployed[/glow] (0) 0.0
regular ACCOM nominal mode = unknown Own Home (0) 0.0
regular SF36PHY1 real avg = ? +/- ? [∞ ; -∞] 1.0
regular GROUP nominal mode = CBT only (1), least = CBT only (1) CBT only (1) 0.0

Now, the problem could be in producing the output from the model or in the actual model applier itself.

To try and test if the data is going missing in the model applier I ran the model applier process a few times, each time changing one of the suspect variables to a missing value and found the following predictions:

Original:                         1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
Female Missing:             1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
EDUC Missing:                1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946
EMPLOY Missing:          1 191.0 Unsuccessful 0.6657894736842105 0.33421052631578946

Well, I think you get the picture there. The data for these variables seems to be treated by the model applier as if it is missing.


Am I going mad? Have I missed something obvious?
How do I attach my data files?

Answers

  • martynsmartyns Member Posts: 15 Maven
    As a followup:

    If I enter the data for two instances into the temp file I get the following results:

    UR SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS) confidence(Unsuccessful) confidence(Successful)
    132.0 Male Current Long-Term Uni Employed Rent 80.0 CBT only Unsuccessful .7 .3
    191.0 Female Current Long-Term Senior (Yr 12) Unemployed Own Home CBT only Unsuccessful .7 .3

    Works beautifully!

    Actually, I think I have tracked this bug a little further now.

    If I enter the first 10 instances all together at once I get the following
    UR SEX MARSTAT EDUC EMPLOY ACCOM SF36PHY1 GROUP prediction(SUCCESS) confidence(Unsuccessful) confidence(Successful)
    132.0 Male         Current Long-Term   Uni                 Employed         Rent             80.0 CBT only         Unsuccessful .7 .3
    191.0 Female Current Long-Term   Senior (Yr 12) Unemployed Own Home CBT only         Unsuccessful .7 .3
    360.0 Female Current Long-Term   Senior (Yr 12) Employed         Own Home  100.0 CBT only         Unsuccessful .7 .3
    1173.0 Female Current Long-Term   Junior (Yr 10) Employed         Own Home  90.0 Combination Unsuccessful .6 .4
    1191.0 Female Current Long-Term   Junior (Yr 10) Unemployed Own Home  50.0 Combination Successful .3 .7
    1193.0 Female Previous Long-Term  Junior (Yr 10) Unemployed Rent             85.0 Combination Successful .3 .7
    13879.0 Male         Current Long-Term   Junior (Yr 10) Employed         Rent             95.0 Refuseniks Unsuccessful .7 .3
    14562.0 Female Previous Long-Term  Junior (Yr 10) Unemployed Rent           100.0 CBT only         Unsuccessful .7 .3
    15655.0 Male         Single                   Senior (Yr 12) Employed         Rent             55.0 Combination Successful .3 .7
    16126.0 Male         Single                   Junior (Yr 10) Employed         Own Home  90.0 Combination Unsuccessful .6 .4

    They all work! But if I enter them individually one at a time we see the same behaviour as above in that variables are only displayed if they are the first in the nominal list. Individual results

    132.0 Male Current Long-Term Uni Employed Rent  80.0 CBT only Unsuccessful .7 .3
    191.0 ? Current Long-Term ? ?         ?         CBT only Unsuccessful .7 .3
    360.0 ? Current Long-Term ? Employed ?   100.0 CBT only Unsuccessful .7 .3
    1173.0 ? Current Long-Term ? Employed ?   90.0 ?         Unsuccessful .6 .4
    1191.0 ? Current Long-Term ? ?         ?   50.0 ?         Successful .3 .7
    1193.0 ? ?                         ? ?         Rent   85.0 ?         Successful .3 .7
    13879.0 Male Current Long-Term ? Employed Rent   95.0 ?         Unsuccessful .7 .3
    14562.0 ? ?                         ? ?         Rent   100.0 CBT only Unsuccessful .7 .3
    15655.0 Male ?                         ? Employed Rent  55.0 ?         Successful .3 .7
    16126.0 Male ?                         ? Employed ?   90.0 ?         Unsuccessful .6 .4

    The predictions made are all the same as above, so I hope that that is an indication that the predictions are being made correctly and using all of the data.

    It gets interesting when you start looking at two instances together.

    If we combine the first two instances then all of the second prints out correctly!
    132.0 Male         Current Long-Term Uni                 Employed         Rent             80.0 CBT only Unsuccessful .7 .3
    191.0 Female Current Long-Term Senior (Yr 12) Unemployed Own Home CBT only Unsuccessful .7 .3

    If we combine the second and third we get something quite horribly wrong. Now both employed and unemployed display BUT it is showing the wrong value for the wrong person!
    191.0 ? Current Long-Term ? Employed         ?         CBT only Unsuccessful .7 .3
    360.0 ? Current Long-Term ? Unemployed ? 100.0 CBT only Unsuccessful .7 .3

    For these two the example example source loader shows
    1 191.0 Female Current Long-Term Senior (Yr 12) Unemployed Own Home NaN         CBT only
    2 360.0 Female Current Long-Term Senior (Yr 12) Employed         Own Home 100.0 CBT only

    Then the examplesetwriter
    1 191.0 Female Current Long-Term Senior (Yr 12) Unemployed Own Home NaN         CBT only
    2 360.0 Female Current Long-Term Senior (Yr 12) Employed         Own Home 100.0 CBT only

    At the examplesource breakpoint all is still well
    1 191.0 Female Current Long-Term Senior (Yr 12) Unemployed Own Home NaN         CBT only
    2 360.0 Female Current Long-Term Senior (Yr 12) Employed         Own Home 100.0 CBT only

    However, something goes horribly wrong when we reach the model applier breakpoint!
    Looking at the data view tab (I truncated the first probability)
    1 191.0  Unsuccessful  0.66578947  0.33421052631578946  ?  Current Long-Term ? Employed         ? NaN       CBT only
    2 360.0  Unsuccessful  0.66578947  0.33421052631578946  ?  Current Long-Term ? Unemployed ? 100.0 CBT only

    EMPLOYED HAS SWITCHED INSTANCES!

    The log states the following which seems ok:

    May 24, 2009 2:16:00 PM: [NOTE] ExcelExampleSource: Breakpoint reached
    P May 24, 2009 2:16:58 PM: [NOTE] ExampleSetWriter: Breakpoint reached
    P May 24, 2009 2:17:37 PM: [NOTE] ExampleSource: Breakpoint reached
    P May 24, 2009 2:18:17 PM: [NOTE] ModelLoader: Breakpoint reached
    P May 24, 2009 2:18:31 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'SEX', training: 2, application: 1
    P May 24, 2009 2:18:31 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'MARSTAT', training: 3, application: 1
    P May 24, 2009 2:18:31 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'EDUC', training: 5, application: 1
    P May 24, 2009 2:18:31 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'EMPLOY', training: 3, application: 2
    P May 24, 2009 2:18:31 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'ACCOM', training: 3, application: 1
    P May 24, 2009 2:18:31 PM: [Warning] W-J48: The number of nominal values is not the same for training and application for attribute 'GROUP', training: 6, application: 1
    P May 24, 2009 2:18:31 PM: [NOTE] ModelApplier: Breakpoint reached

    If we combine the first and third instances everything seems ok again
    132.0 Male         Current Long-Term Uni                 Employed Rent                 80.0         CBT only Unsuccessful .7 .3
    360.0 Female Current Long-Term Senior (Yr 12) Employed Own Home 100.0 CBT only Unsuccessful .7 .3

    If I send the last three then the output all goes incredibly wrong, variables are being outputted with some sort of order, but it is not the same order in which they were inputted!
    input:
                    Female Previous Long-Term  Junior (Yr 10)  Unemployed Rent         100         CBT only         14562.00
                    Male         Single                   Senior (Yr 12)  Employed Rent         55         Combination 15655.00
                    Male         Single                   Junior (Yr 10)  Employed Own Home 90         Combination 16126.00

    output:
    14562.0 Male Single                   ?                   Employed Rent               100.0 CBT only         Unsuccessful .7 .3
    15655.0 Female ?                           Senior (Yr 12) Unemployed Rent                 55.0         Combination Successful .3 .7
    16126.0 Female ?         ?                                   Unemployed Own Home 90.0         Combination Unsuccessful .6 .4

    Finally if we add the first case back on top
    132.0 Male         Current Long-Term    Uni                     Employed Rent         80.0         CBT only Unsuccessful .7 .3
    14562.0 Female Previous Long-Term  Senior (Yr 12)  Unemployed Rent                 100.0 CBT only Unsuccessful .7 .3
    15655.0 Male         Single                   Junior (Yr 10)  Employed Rent                 55.0         Combination Successful .3 .7
    16126.0 Male         Single                   Senior (Yr 12)  Employed Own Home 90.0         Combination Unsuccessful .6 .4

    Most of them are correct except that EDUC has been flipped.

    So, in summary, it seems that the model applier is working as the results are consistent,
    numerical values are fine,
    nominal values are being assigned to the wrong category if instances are entered one by one or in small groups in the model output stage of proceedings!

  • haddockhaddock Member Posts: 849 Maven
    Hi Martyn,

    It is a pain that attachments have been disabled, even on personal messages, so it will be difficult to replicate your problem unless you email me the data.

    One thought which may have relevance is this. Sticking a question mark in to indicate missing data doesn't indicate missing data to RM, like this....
    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="target_function" value="random"/>
        </operator>
        <operator name="BinDiscretization" class="BinDiscretization">
            <parameter key="range_name_type" value="short"/>
        </operator>
        <operator name="Replace" class="Replace">
            <parameter key="attributes" value=".*"/>
            <parameter key="replace_what" value="range1"/>
            <parameter key="replace_by" value="?"/>
        </operator>
    </operator>
    Whereas putting nothing in the "replace_by" slot does. I realise that this may be completely irrelevant, but it is difficult to tell without the data. Still the point to take is that "?" is a nominal value as far as RM sees it.

    Ooops, time for Sunday lunch and copious grog, better zoom off.


  • martynsmartyns Member Posts: 15 Maven
    Thanks for the reply Haddock.

    I am happy to email the data I used for the test to anyone who is interested in taking a look.

    I tried to put nothing in the replace with box but it didn't appear to be replacing anything, just passing it through.

    To make a test of this I took the same 10 cases as I have been using for prediction above but replaced all of the numeric values for the only numeric variable with 999 which is being used for missing values.

    For the first run I used question mark in the replace with box and for the second run I left the replace with box empty. Only 2 predictions changed under these 2 conditions.

    For ? we have:
    1173.0 Female  Current Long-Term Junior (Yr 10) Employed Own Home  Combination Successful .5 .5
    13879.0 Male     Current Long-Term Junior (Yr 10) Employed Rent     Refuseniks Successful .5 .5
    (It should be noted here that no values were outputted for the numeric variable)

    These are the same values that I get when I leave the numeric values empty:
    1173.0 Female Current Long-Term Junior (Yr 10) Employed Own Home  Combination Successful .5 .5
    13879.0 Male Current Long-Term Junior (Yr 10) Employed Rent             Refuseniks Successful .5 .5


    Using an empty replace with
    1173.0 Female  Current Long-Term Junior (Yr 10) Employed Own Home  999.0 Combination Unsuccessful .6 .4
    13879.0 Male     Current Long-Term Junior (Yr 10) Employed Rent             999.0 Refuseniks Successful .0 1.0
    (So you can see that it spits out 999s here, seemingly not replaced and the predictions are different)

    When I use a value of 998 for the numeric variables the predictions are the same as in the 999 case.
    1173.0 Female  Current Long-Term Junior (Yr 10) Employed Own Home  998.0 Combination Unsuccessful .6 .4
    13879.0 Male     Current Long-Term Junior (Yr 10) Employed Rent             998.0 Refuseniks Successful .0 1.0

    So in this case it really looks like the ? is acting as a missing value but with nothing in the replace with box it does not replacing and therefore is interpreting the 999 as the number 999.

    Are you sure that an empty replace with box really works?

    Does it do different things when applied to numeric and nominal values?
Sign In or Register to comment.