NominalToNumerical inconsistency with different sources

pablo_admigpablo_admig Member Posts: 5 Contributor II
edited November 2018 in Help
The situation could be replicated with the Template "Apply to Test Set" having, i.e., one nominal column, and changing the kNN model for a Neural Network.

So, in order to use the Neural Network (or any alghoritm that does not support nominal attributes), I have to convert that attribute to a numerical one with the NominalToNumerical operator, and RapidMiner does a "mapping" of each category. For example, the operator reads category "Sunny" in that column and assigns the number 1, reads the category "cloudy" and assigns the number 2, and so on.

The problem comes when this mapping or conversion is not the same in Training and Test set, because I need two NominalToNumerical operators, (Training and Test set), and they are not related, so each one will convert the category into numbers following the natural order of each table. For example, if the first record of the training set has "Sunny", it will convert into 1. And if the first record of the Test set has "Cloudy", it will convert into 1 as well ! So for the neural network Cloudy=Sunny, turning this into a serious problem.

I want to know if it has a solution into the RapidMiner enviorment.

Thanks in advance,
Pablo.

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,642  RM Founder
    Hi Pablo,

    yes, there is a solution: you don't have to worry about this as far as I know  :)

    The neural net model, as all other models, keeps the header information of the input example set used for training. This information also contains the information about the used mapping, i.e. the fact that "Sunny" was assigned to "1" and so on. During model application, the incoming values of the test set like "1" will first be translated to "Cloudy" (since this was the transformation used in the test set) and "Cloudy" will then be transformed again based on the training header information to "2" before the model actually is applied. So there is actually no serious problem - at least as long as no bug is preventing this automatic nominal mapping as it used to has a couple of years ago  ;)

    If you want to transform the values yourself in order to make it absolutely sure without having to rely on the automatic mechanism described above, you could of course first use the operator "Map" to map the nominal values to "nominal" numbers and afterwards use "Parse Numbers" in order to transform them to real numbers. But I would actually not bother with this.

    Cheers,
    Ingo

  • pablo_admigpablo_admig Member Posts: 5 Contributor II
    Ingo, thanks for the reply.
      I test in detail that with a simple example. And it's right, the prediction is the same. However, if I see the outputs of the conversions, in Training and Test set (with the label from the model), I could see the "inconsistency". That is, if I see the numbers instead of categorical values and their associated label, the label calculation is consistency, columns input (transformed to numerical) in the table with the label, are not.
    Is it clear?

    Regards,
    Pablo.
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,642  RM Founder
    Hi Pablo,

    yes, I see. But be assured: Those "inconsitencies" only exist as long as the model is not applied since this would make sure that the inconsistency is resolved. So sometimes it's easier to not look into too much details  ;)

    Cheers,
    Ingo
Sign In or Register to comment.