Error when applying a trained model to a new unlabeled data set

StannStann Member Posts: 5 Learner I
I want to apply a Naive Bayes model to a new (unlabeled) data set. The model has already been trained and tested via cross-validation. However when I try to apply the model to a brand new data set I get an error message.

Here is an overview of my process and the error I get:


The "Retrieve aggregate" is the new (unlabeled) data set, which I want to predict using my trained model.

"Process Documents from Data" contains a "Tokenize" operator.

The subprocesses within the Cross Validation operator are:


I am new to RapidMiner and I have no clue as to why I get this error :(
I would greatly appreciate your help as I need to carry on with my research :)

Best Answer

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,193 Unicorn
    Solution Accepted
    @Stann,

    Yes it is possible :

    As said apply the same preprocessing steps in your test set "branch"

    and connect the word output (wor) of Process Documents from Data  operator of your training "branch" to the word input (wor) of your Process Documents from Data of your test set branch.

    Regards,

    Lionel
    Stann

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,193 Unicorn
    Hi @Stann,

    The attributes have to be strictly the same in your training set and in your unlabeled test set.
    Thus you have to apply strictly the same preprocessing steps to your unlabeled test set (thus you have to apply
    Nominal to text and Process Documents from data operators to your test set) . Currently you are applying the raw test set to your model...

    Hope this helps,

    Regards,

    Lionel 
  • ceaperezceaperez Member Posts: 302 Unicorn
    Hi @Stann,

    It seems that the name of Attributes (columns) in your Train dataset and Test dataset, aren't the same.
    please verify the name and type of your test dataset.

    Best
    Stann
  • StannStann Member Posts: 5 Learner I
    @lionelderkrikor, @ceaperez thank you for your quick response.

    Having the exact same attributes would be impossible as each attribute is a token (word) which appeared in the initial text document. Since the new (unlabeled) data set contains different text documents as the training set, the attributes would always differ, because the text documents in the new data set are comprised of "new" tokens.

    Having said that, is there still a way to apply the model to a new (unlabeled) set?
Sign In or Register to comment.