🎉 🎉. RAPIDMINER 9.8 IS OUT!!! 🎉 🎉

RapidMiner 9.8 continues to innovate in data science collaboration, connectivity and governance

CLICK HERE TO DOWNLOAD

Internal mapping issue

inthewoodsinthewoods Member Posts: 9 Contributor II
edited November 2018 in Help
I keep getting the error:

Dec 17, 2010 9:32:54 PM WARNING: SimpleDistribution: The internal nominal mappings are not the same between training and application for attribute 'att6'. This will probably lead to wrong results during model application.
Dec 17, 2010 9:32:54 PM WARNING: SimpleDistribution: The internal nominal mappings are not the same between training and application for attribute 'att7'. This will probably lead to wrong results during model application.

Now I've carefully mapped my training data and my prediction/out-of-sample data - I've looked at the two files and the Attribute line up - meaning att6 is the same across both files, and att7 is the same across both files.

Anybody have any idea what's going on here?

Answers

  • inthewoodsinthewoods Member Posts: 9 Contributor II
    I gotta say I'm a bit disappointed that the RM community, or support, has anything to say about this error - I thought the community was a bit stronger.  I understand that Sebastian can't answer every question, but it is frustrating to try and get up and running on a program as complex as RM without a strong support forum.
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,749  RM Founder
    Hi,

    (don't bother with any greetings - why should you?)

    this is said by someone who never bothered to answer any question but only insist of having his own ones answered. If all our community members would have this spirit, it would indeed be a weak one. Until getting more involved, I would appreciate if you just stop to insult us, ok?

    Did it come into your mind that a "WARNING" containing the word "probably" might indeed not be a real problem? Maybe this is simply the reason why noone cared enough to stop his christmas festivities just to ensure that you can sleep calmly. By the way: the forum is full of hints about the "internal nominal mapping" issues. Maybe the real community members knew that?

    With best regards,
    Ingo
  • inthewoodsinthewoods Member Posts: 9 Contributor II
    Hi,

    Let's see - "never bother to answer a question" - I've been using the program for about 2 weeks.  I hardly feel qualified to answer any question - so that is a pretty ridiculous statement.  I belong to several technical groups where I am an active participant, and I spend a great deal of time helping others with their issues.  I'm sorry if you're insulted by my comment - you're right that it was unfair, and I apologize.

    No - it did not come into my mind that the word probably might not be the real problem - because I'm just trying to understand what is going on.

    I'll do another search on internal mapping issues - I did one but didn't find anything that matched my case.  If someone has seen this before, maybe they could post a link as a reference.

    Sorry to have bothered you Ingo.
  • haddockhaddock Member Posts: 849  Guru
    Greets Chaps!

    Actually the bloke in the woods may have a point - according to the stats the average forum member never answers a question. So have a thought for poor Ingo and his merry men, who provide, for free, pretty rapid support to those forum members, even over Christmas. However, it is unwise for a novice to declare an error from a warning when the Magus is on the dog watch, most unwise...

    So you, bloke in the woods, have common, and correct, cause with Ingo. Forum members should try giving, just for a pleasant change.

    Wouldn't that be cool?

  • steffensteffen Member Posts: 347  Maven
    Hello together,

    Stumbling across ...

    @inthewoods:
    Can you please post the complete process you have executed when the warning occurs ? And list all the plugins you have installed ?  (I further assume that you are using RM 5.1). Then I see what I can do.

    @haddock: Wonderful wording and precisely to the point, as always.

    greetings,

    steffen

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,749  RM Founder
    Hi again,

    thanks for the sympathy, haddock, I guess from time to time I feel like the well-known camel having some problems in the area of straw-logistics  :D

    Inthewoods, sorry for hitting on you. I must admit that the technique "I am not angry. I am only veeeeerry disappointed" has probably rung some bells reminding me of my teenager times  ;)

    And finally: Hi Steffen! Nice to hear from you again! Just this morning I stumbled upon your blog entry about recompiling RM 5.1 from the Zip file and sent it to our developers hoping that they will improve the shortcomings soon. Right now almost everybody is out of office (dog watch is a perfect description...) but next week the others are back as well.

    Cheers,
    Ingo
  • inthewoodsinthewoods Member Posts: 9 Contributor II
    @Ingo - no worries - I was out of line and again apologize for the tone of my post.

    @haddock:  Here's my code - it's pretty simple.  Two files with the exact same column names pulled from Excel files.  The second file (for prediction) has blanks in the column I'm trying to predict.  What I'd really like to be using for a model is the adaboost model, but I can't figure out what I should be using for a learning model within the adaboost module - any thoughts on that appreciated.

    More broadly - am I thinking about this correctly in terms of how I'm handling the files?  This is market data, and I've divided my set into a training set and a prediction set (two separate files), with the column I'm trying to predict being empty in the second file.  Or would I be better off actually having one file and using the split function?

    Another question I have is how to do a rolling window of training and prediction - meaning let's imagine a total data set of, say, 50 weeks.  In the first test, I want to train on weeks 1-9 and predict week 10, then train on weeks 2-10 and train on week 11, etc.  Is this possible in RM?

    Thoughts/help greatly appreciated - I look forward to becoming a CONTRIBUTING member of the group.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
        <process expanded="true" height="341" width="614">
          <operator activated="true" class="retrieve" compatibility="5.1.001" expanded="true" height="60" name="Retrieve" width="90" x="44" y="31">
            <parameter key="repository_entry" value="SPY_training_data"/>
          </operator>
          <operator activated="true" class="naive_bayes" compatibility="5.1.001" expanded="true" height="76" name="Naive Bayes" width="90" x="282" y="21"/>
          <operator activated="true" class="retrieve" compatibility="5.1.001" expanded="true" height="60" name="Retrieve (2)" width="90" x="112" y="120">
            <parameter key="repository_entry" value="SPY_out_of_sample"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="5.1.001" expanded="true" height="76" name="Apply Model (2)" width="90" x="380" y="120">
            <list key="application_parameters"/>
          </operator>
          <connect from_op="Retrieve" from_port="output" to_op="Naive Bayes" to_port="training set"/>
          <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Retrieve (2)" from_port="output" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

  • haddockhaddock Member Posts: 849  Guru
    Hi there,

    My speciality is market dynamics, and I've spent some time with RM. My advice to you is unequivocal - work through the Tutorial, and the samples, religiously. Believe you me, it is quicker in the long run. Let me show you how...

    You don't need to keep training and testing examples apart, check out the notion of Validation in the Tutorials and Examples to see why ( when in doubt look in Wikipedia for datamining terms ). Normally you check your predictions against actuals in a validation process, so blanks for the label would mess things up, perhaps the cause of your original grief. Actually you'll need a sliding window validation for your market work, but do not bother to try that until you understand what validation does, and why. In a similar vein don't rush into boosting until you know why you need it - you may not!

    I know this is probably not what you wanted to hear, but it makes sense. Run before walk, splat!

    Oh yes, you'll need to install the series stuff - but don't, don't, I mean really don't do that until you're ready ( unless of course you have time to waste, like you're in prison or something! ).

    Just my two wotnots...

  • inthewoodsinthewoods Member Posts: 9 Contributor II
    Thanks Haddock - a quick follow-up - by tutorials do you mean the ones put together by Thomas Ott, or other ones?  And when you say tutorial - do you mean the tutorial videos or a document?

    Thanks in advance - all good advice!
  • haddockhaddock Member Posts: 849  Guru
    Hi there,

    I was meaning all that stuff in the Help sub-menu - enough for even the most sleepless insomniac...

    Onward through the fog, charge!
  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458   Unicorn
    Hello there,

    I second what Haddock says; walk first. If I look back, I can see that it took a fair amount of my spare time to get to the point of knowing the tool well enough to be confident in its use. It didn't put me off though and I'm even doing an MSc in data mining for fun so I can move from the "how" of walking to the "why" of running. Having said that, you can certainly get to a brisk jog pretty quickly without needing extra qualifications if you have a definate problem to solve.

    If you haven't seen it already, this site http://rapidminerresources.com/ provides some stout walking boots and draws together various videos including the ones you already mentioned.

    regards,

    Andrew
  • steffensteffen Member Posts: 347  Maven
    Hello,

    I suppose the "original" mapping problem is no solved postponed meaningless.

    @Ingo:
    I did not drop a note here, because I had discussed a similar issue with sebastian in the times of 5.0 RC and I didn't want to be too pesky about it. Guessed you already have that on your parsec-long todo-list.

    All the best,

    Steffen

  • fritmorefritmore Member Posts: 90  Maven
    Hi there
    i have gotten this warning too.

    The output seems ok though.
    The metadata for learning model and traning are Identical.
    (yes I know about validation, haddock :), I am using it in the context of this
    http://rapid-i.com/rapidforum/index.php/topic,3394.0.html thread I started.

    I tried different things it keeps coming but still the results look good.

    It is interesting though that when one learnes a model on 3 bin data tests with 6 bin data there is only a simmilar warning but the output is still created.
    I guess the bins are somehow merged or ignored.... ???    I know this is errorneous usage but still whz the generated results?

    cheerz
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531   Unicorn
    Hi,
    well the mapping issue again. Ok, we can talk about it, but pleease give me a short summarization if there's any question left. With a look to the many open issues here on the forum, I can't afford reading the complete topic with all it's posts. So if there is still an example for a problem where no problem should be, please post a demo process here with a short description.

    Greetings,
      Sebastian
  • pathrospathros Member Posts: 9 Contributor II
    Hi everyone.
    Hallo, die deutsche Leute!

    I had this problem a long time ago.
    What i understood is that, when Rapidminer has written the model learned from the training set, it writes the model like ... for example, we have a column, let's say "transport_type" which may take one of the possible nominal values: {subway, s-bahn, regio, IC, ICE, bus, tramway}.

    your column may look like this (please notice the ORDER):

    *transport_type
    1 ICE
    2 ICE
    3 regio
    4 ICE
    5 bus
    6 bus
    7 ICE
    8 subway
    9 ICE
    10 S-bahn
    11 tramway
    12 IC
    13 tramway
    ...


    So, after having learned and gotten the best model, you can open the file and see something LIKE this:
    column 3
    <column name: transport_type>
    <values>
              ICE
              regio
              bus
              subway
              s-bahn
              IC
    ...
    </values>

    As you can see, Rapidminer wrote the model as it was finding the values. Because it first read the "ICE" value, it wrote it first. Then, the second new value was "regio", so RM wrote it after ICE as the second one ... and so on.


    Then, you have your Test Set that may look like this:

    transport_type
    bus
    regio
    ICE
    regio
    regio
    s-bahn
    bus
    bus
    ICE
    ICE
    IC
    Tramway ...

    I have noticed, that because your test set does not contain the same order of values, RapidMiner, when applying the model to this Test Set, it sends the following warning:
    "Aug 12, 2011 4:20:21 PM WARNING: SimpleDistribution: The internal nominal mappings are not the same between training and application for attribute 'transport_type'. This will probably lead to wrong results during model application."
    ()

    What i did to solve this was to create my own .aml file, But this takes lots of time.

    Now i wanted to apply another model and i really feel lazy. Donnerwetter. anyway.

    another thing i've got to tell you, is that, i compared the predictions of the model applier with warnings with the one without them (because i created the .aml file). What i noticed is that some predictions
    were different.

    See you.
    Grüsse aus Mexiko.
  • dacamo76dacamo76 Member Posts: 9 Contributor II
    I had this problem over the weekend and in reality it is two separate problems.

    1) The warnings that RapidMiner gives are
    2) Predictions are different over different runs of data.

    I will address problem 1 first.
    These warnings are useless. When a new data set is used to apply a model, the mappings are going to be different, because each data set creates a mapping in the order values are seen. So we are always going to have different mappings when we compare data sets directly. This exactly what the warnings in PredictionModel.java do.
    Therefore you will always get a warning.
    If the mappings don't have the same number of values, you will get the following:
    WARNING: SimpleDistribution: The number of nominal values is not the same for training and application for attribute...
    If the mapping for an attribute just happens to have the same size, then you will get the warning:
    WARNING: SimpleDistribution: The internal nominal mappings are not the same between training and application for attribute 'att_1'. This will probably lead to wrong results during model application.

    You won't get a warning if the input data nominal values are seen in the exact same order as in the training set. Which is highly unlikely.
    You can verify this by applying a model on a data set with one example. Any nominal attributes in the model with more than one mapping value will give you the first warning. If the model nominal attribute mapping contains only one value, and the value is not the same as the value in the input data set example, you will get the second warning.

    Before applying a model on an input dataset, RapidMiner remaps the input data so it conforms to the nominal mappings in the model. This is done by way of transformations. So in reality, the input dataset still contains its original mapping, different that the mapping contained in the model. When the model is applied, RapidMiner applies the transformation to get the correct value for that attribute.

    For the warnings to be meaningful, the transformations must be applied to see if the mappings are actually equal. The way it is done now, you are getting a warning that does not mean the actual mappings will be different when the model is applied. In a deployed production environment, you are likely to always see the warnings, unless you have very strict transformations beforehand to restrict each attribute to a finite set of values equal to the values in the training set. In this case, you will almost always see the second warning, which seems worse coming through your logs.

    If actually doing transformations to check validity of mappings is too expensive for each input data set, and I would imagine it is, then it may be better to turn off the warnings so as to avoid confusion. Either way, this should probably be filed as a bug.
  • dacamo76dacamo76 Member Posts: 9 Contributor II
    The second problem mentioned in this thread is:

    Predictions are different over different runs of data.

    This is due to a bug in the way transformations are made when applying a model.
    I have seen this in a Bayes model, but it probably applies to all models who call  example.getValue(attribute) to get the value of an attribute.
    This call will eventually lead to a call to AttributeTransformationRemapping.transform(attribute, value). This method applies a transformation to get the double value in the original training set mapping which corresponds to the value in the actual mapping. Remember, the double value in the incoming data set does not necessarily correspond to the double value in the training data set.
    The problem with this method is that if the value does not have a corresponding mapping in the training set mapping, it will return the original value instead of -1.

    Let me go into more detail.
    I will use the attribute transport_type  described by @pathros above.

    In the training data set, the model sees the values in this order:
    1 ICE
    2 ICE
    3 regio
    4 ICE
    5 bus
    6 bus
    7 ICE
    8 subway
    9 ICE
    10 S-bahn
    11 tramway
    12 IC
    13 tramway

    So the training set mapping will be the following:
    0 ICE
    1 regio
    2 bus
    3 subway
    4 S-bahn
    5 tramway
    6 IC

    Now to keep it simple, lets apply the model to data set with only one example.
    Lets say the value of transport_type  is "regio".

    We now have a mapping in the input data set that looks like this:
    0 regio

    (Notice, in my post above, how comapring these two mappings will give you a warning for different sizes)

    To apply the model, a transformation is made, and the value in the training set mapping is obtained.
    The double value of regio in the input data set is 0. So the following happens in the transform method.

    String nominalValue = attribute.getMapping().mapIndex(0); // returns "regio"
    int index = overlayedMapping.getIndex(nominalValue);  // returns 1
    if (index < 0) {          // FALSE
       return value;      
    } else {
       return index;        // returns 1
    }
    So the transformation is made and the double value of the attribute is 1, corresponding to "regio" in the training example set.
    Here we are ok.

    Now lets examine another data set with only one transaction, but with an unseen value.
    transport_type is "spaceship"
    This transaction has an unseen value in the transport_type attribute.
    The transformation is applied like so:

    String nominalValue = attribute.getMapping().mapIndex(0); // returns "spaceship"
    int index = overlayedMapping.getIndex(nominalValue);  // returns -1
    if (index < 0) {          // TRUE
       return value;        // returns 0

    } else {
       return index;
    }
    So the transformation is made and the double value of the attribute is 0, corresponding to "ICE" in the training example set.
    Here we have a problem. The model thinks the value is for the attribute is "ICE".

    Now lets assume the exact same examples from above are passed in as one data set containing two examples.
    Here is the data set values for transport_type:
    regio
    spaceship

    the first example will get transformed like so:

    String nominalValue = attribute.getMapping().mapIndex(0); // returns "regio"
    int index = overlayedMapping.getIndex(nominalValue);  // returns 1
    if (index < 0) {          // FALSE
       return value;      
    } else {
       return index;        // returns 1
    }
    So the transformation is made and the double value of the attribute is 1, corresponding to "regio" in the training example set.
    Here we are ok. The result will be exactly the same as what we saw above.

    String nominalValue = attribute.getMapping().mapIndex(1); // returns "spaceship"
    int index = overlayedMapping.getIndex(nominalValue);  // returns -1
    if (index < 0) {          // TRUE
       return value;        // returns 1

    } else {
       return index;
    }
    So the transformation is made and the double value of the attribute is 1, corresponding to "regio" in the training example set.
    Here we have another problem.
    This exact same example will return a different result than the one we saw when we only passed in one example.
    In fact, this result will be exactly the same as the results for the first example, since the model sees a 1, and will apply the model as if the value is "regio"

    This simple example shows how the same example can different results depending on what order it is seen in the input data set.
    You can see if we flip the two examples in the last data set, we will get results equivalent to:
    ICE
    regio
    since the first example 'spaceship" will return a double value of 0 when the transformation is applied.


    Edit:
    I filed a bug http://bugs.rapid-i.com/show_bug.cgi?id=847 and supplied a patch to make the SimpleDistributionModel  (Bayes) work correctly.
    I assume other models are affected by this, but can't confirm.
  • pathrospathros Member Posts: 9 Contributor II
    @dacamo76
    Thanks for your comments!

    I have recently loaded my datasets to Rapidminer (both training and test ones) by importing repositories from excel format files.

    Now i have no longer problems concerning the mappings. Try that.
  • fischerfischer Member Posts: 439  Maven
    Hi,

    we fixed the handling of the remapping which should solve the issue with other models as well. Please check with tonights SVN update or next release and close #857 if it does.

    Best,
    Simon
  • dacamo76dacamo76 Member Posts: 9 Contributor II
    Simon Fischer wrote:

    Hi,

    we fixed the handling of the remapping which should solve the issue with other models as well. Please check with tonights SVN update or next release and close #857 if it does.

    Best,
    Simon
    Hi Simon,

    As of now (r407) I don't see the changes.
    I'll wait for the SVN synchronization (which seems to be around 20:00 CST) and let you know.

    Thanks
Sign In or Register to comment.