[SOLVED] Apply IDF of training set in test

miguel · March 2012

Hi,

I am trying to use RM to solve a Document Classification problem. I use two different Process Document from Files. One for the test documents and one for the train documents. The problem I have is that they apply TF-IDF for each document based on the specific set. In Text classification, the creation of TF-IDF for the testing documents is performed using the IDF from the train documents.

For instance, if we only want to classify one document (using the same structure), the TF-IDF for the document should be based on the occurrences of terms in the document and the IDF previously computed based on the training collection. In the same example, if IDF is based on the test document alone all the features will become 0, as all the terms appear in all documents (one) of the test collection.

The only option I can think of is to store the IDF for the train document terms and then multiply them by the TF of the test documents but it sounds a bit like a hack. Is there any operator or some parameter I am missing?

Regards,

MariusHelf · March 2012

Hi miguel,

you probably want to connect the wor output of the Process Documents used for training to the wor input of the Process Documents operator for testing.

Best,
Marius

miguel · March 2012

That will specify the terms used in the train set and do a filter of the terms in advance. In my case I do feature selection based on chi square after so it is not needed at this stage. However, a follow up question would be, if we connect the train words, will the test set use them as set of words (for filtering only) or it will also know which terms were in which (or at least how many) documents?

For the experiments I am running at the moment, even when words are plugged-in, they only use the list as a filter. Therefore, IDF is still computed from the test set. Good point though

Thanks a lot for the rapid response,

MariusHelf · March 2012

Hi,

why do you apply the feature selection on the test set and not on the training set?

The TF-IDF calculation on the test set considers the word vector of the training set, if you connect the wor outputs. Consider this process, especially the value of blu with wor connected or disconnected:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.002" expanded="true" name="Process">
    <process expanded="true" height="415" width="614">
      <operator activated="true" class="text:create_document" compatibility="5.2.001" expanded="true" height="60" name="Create Document" width="90" x="112" y="30">
        <parameter key="text" value="bla bla"/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="5.2.001" expanded="true" height="60" name="Create Document (2)" width="90" x="112" y="120">
        <parameter key="text" value="bla blu"/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="5.2.001" expanded="true" height="60" name="Create Document (3)" width="90" x="112" y="210">
        <parameter key="text" value="blu blo bla"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.2.001" expanded="true" height="130" name="Process Documents" width="90" x="313" y="30">
        <process expanded="true" height="472" width="923">
          <operator activated="true" class="text:tokenize" compatibility="5.2.001" expanded="true" height="60" name="Tokenize" width="90" x="397" y="90"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="5.2.001" expanded="true" height="60" name="Create Document (4)" width="90" x="313" y="345">
        <parameter key="text" value="bla blu ble"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.2.001" expanded="true" height="94" name="Process Documents (2)" width="90" x="514" y="300">
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.2.001" expanded="true" name="Tokenize (2)"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
      <connect from_op="Create Document (3)" from_port="output" to_op="Process Documents" to_port="documents 3"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents" from_port="word list" to_op="Process Documents (2)" to_port="word list"/>
      <connect from_op="Create Document (4)" from_port="output" to_op="Process Documents (2)" to_port="documents 1"/>
      <connect from_op="Process Documents (2)" from_port="example set" to_port="result 3"/>
      <connect from_op="Process Documents (2)" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

miguel · March 2012

I apply the feature selection to both sets based on the chi squared values of the training collection as it is usually realised in TC. However, I see I can simplify this.

About the example, it shows clearly that IDF is considered if the words are connected. I tried to do the same experiment with my data a couple of days ago but all the features had a value of zero. It is clear that the mistake was somewhere else, I should have been more careful.

Thanks for the help

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[SOLVED] Apply IDF of training set in test

Answers