[SOLVED] Applying a pre-trained model on new data

monami555monami555 Member Posts: 16 Contributor II
edited November 2018 in Help
Hello

I have the following concern. If I apply a model on data that have slightly different set of features that the data the model was trained on - what happens with the values of attriubtes not present in test data but present in the models and vice versa, is that a problem for the model to be applied correctly?

This problem occurs in text classification, as features are words, and feature set becomes wordlist. When I extract wordlist from a set of training documents, and then want to classify a new document, it is obvious that the features of new document will be different. How should this be handled?

I would expect that applying the old model on new data would anyway bring the same results as if the features vere extracted collectively, as missing values would be assumed 0, and they were anyway not present in test data. But, I have compared these two approaches:
1. Extracting features from all data set, dividing data to test and training data, learning classifier and measuring the accuracy
2. Dividing data to test and training data, extracting features from each set independently, and learning classifier and measuring the accuracy

and I found out that in the second case the classificaton accuracy is much lower (it went to 20% from 70%). Is that something I should have expected, is my logic wrong here? Is there any way to "fix" the new data to match the old model, or fix the old model to match the new data? Or am I having totally wrong approach here?

Regards
Monika

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    how missing or additional features are handled, depends on the classification algorithm and its implementation in RapidMiner. In general, the behaviour is undefined, though in some cases you may get reasonable results.

    However, in your case with text classification you can guarantee that both in training and in testing the same features are generated, by connecting the wordlist output of the training Process Documents operator the the wordlist input of the Process Documents operator in the testing branch. Have a look at the attached process.

    Best,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.001" expanded="true" name="Process">
        <process expanded="true" height="370" width="850">
          <operator activated="true" class="generate_nominal_data" compatibility="5.2.001" expanded="true" height="60" name="Generate Training Data" width="90" x="45" y="30">
            <parameter key="number_of_attributes" value="1"/>
            <parameter key="number_of_values" value="50"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="5.2.001" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="att1"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.2.001" expanded="true" height="76" name="Process Documents from Data" width="90" x="313" y="30">
            <list key="specify_weights"/>
            <process expanded="true" height="541" width="969">
              <operator activated="true" class="text:tokenize" compatibility="5.2.001" expanded="true" height="60" name="Tokenize" width="90" x="313" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="naive_bayes" compatibility="5.2.001" expanded="true" height="76" name="Naive Bayes" width="90" x="447" y="30"/>
          <operator activated="true" class="generate_nominal_data" compatibility="5.2.001" expanded="true" height="60" name="Generate Testing Data" width="90" x="45" y="210">
            <parameter key="number_of_attributes" value="1"/>
            <parameter key="number_of_values" value="50"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="5.2.001" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="210">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="att1"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.2.001" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="447" y="210">
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.2.001" expanded="true" name="Tokenize (2)"/>
              <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
              <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="apply_model" compatibility="5.2.001" expanded="true" height="76" name="Apply Model" width="90" x="581" y="210">
            <list key="application_parameters"/>
          </operator>
          <connect from_op="Generate Training Data" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Naive Bayes" to_port="training set"/>
          <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents from Data (2)" to_port="word list"/>
          <connect from_op="Naive Bayes" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Generate Testing Data" from_port="output" to_op="Nominal to Text (2)" to_port="example set input"/>
          <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
          <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
          <connect from_op="Apply Model" from_port="model" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    (Marius was faster so I have thrown away most parts of my message about using the wordlist - Marius is perfectly right here...)

    So here are only two comments:

    how missing or additional features are handled, depends on the classification algorithm and its implementation in RapidMiner. In general, the behaviour is undefined, though in some cases you may get reasonable results.
    Exactly, and for this reason I would suggest the following as a golden rule: ALWAYS make sure that the attributes used for training and model application are exactly the same. In case of text mining, as Marius has pointed out, this can be done by using the word list from the training process also for the text processing of the application / testing data.

    I found out that in the second case the classificaton accuracy is much lower (it went to 20% from 70%). Is that something I should have expected, is my logic wrong here?
    Yes. The reason is: you have cheated. If you use both training AND test set for the word vector creation, you put information about the distributions of the test set into the training already. This - as happened here - frequently lead to overoptimistic estimations of the predictions accuracy (althogh related: don't confuse this type of cheating with overfitting).

    The strong thing of RapidMiner is that preprocessing is never done automatically during learning and so you can actually control the preprocessing and see its impact on the prediction accuracy. The downside is, that those correct estimations delivered by RapidMiner (if the process setup is done correctly) are almost always worse than the cheated ones delivered by many other solutions. You can see this not only for text preprocessing but also for parameter optimations, attribute selection, attribute weighting, attribute construction...

    I strongly believe that this fair and true evaluation is important not only in science but also for real-world applications. I don't like bad surprises and I also want to know if I can truly stop optimization since I am good enough (instead of just having found a more complex and therefore unspotted way of cheating...).

    Just my 2c,
    Ingo
  • monami555monami555 Member Posts: 16 Contributor II
    Marius, your solution works perfectly thanks:) Ingo, thank you for the clarification, it all makes much more sense now:)

    However, I would just like to make sure I got it correctly.. I was wondering whether I am allowed now to use Marius' solution, or is it still cheating. I think it should be ok, as even though I use training data information during test data feature extraction, the model has been trained without knowing about the test data. Am I right?
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    yes, Marius solution is perfect.

    as even though I use training data information during test data feature extraction,
    This is of course no problem - you always use information about the training data (in most cases: the generated model  ;) ) for model application.

    ...you put information about the distributions of the test set into the training already.
    The other way round, putting testing information into training, is the problem.

    Cheers,
    Ingo
Sign In or Register to comment.