Missing Attribute when applying model although Training exampleSet and real DataSet are the same

TennesseeTennessee Member Posts: 5 Newbie
edited November 2019 in Help
Hi,

I am new to rapidMiner and I am trying to classifly YouTube Comments on Innovation Products into Customer Requirement or not.

Both ExampleSets should be the same as I used the wordlist from the training data and applied it to the data I want to classify with the Process Documents from Data Operator. In the following picuture you can see a comparison of both DataSets.



I used RapidMiner Automodel to create an SVM Classification Process and then I stored the Model with this Process. I then used the following Process:

<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
  <context>
    <input/>
    <output/>
    <macros>
      <macro>
        <key>text</key>
        <value>Lets try the relly simple way. I like smart watches</value>
      </macro>
    </macros>
  </context>
  <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve PrepedYouTubeCommentData" width="90" x="45" y="187">
        <parameter key="repository_entry" value="../Data/PrepedYouTubeCommentData"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve SvmClassificationYComments" width="90" x="45" y="85">
        <parameter key="repository_entry" value="../Results/SvmClassificationYComments"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model" width="90" x="313" y="85">
        <list key="application_parameters"/>
        <parameter key="create_view" value="false"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="85">
        <parameter key="attribute_filter_type" value="numeric_value_filter"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="numeric_condition" value="&gt;0"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <connect from_op="Retrieve PrepedYouTubeCommentData" from_port="output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Retrieve SvmClassificationYComments" from_port="output" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>


However I always get this error:



I used this tutorial  on youTube with the videoid VbNhvYQZ2v0 and the rapidMiner Academy TextMining and Machine Learning course to construct my Processes.

This Process shows my Preprocessing for my Training Data:

<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve" width="90" x="112" y="340">
        <parameter key="repository_entry" value="../Data/SampleDataYouTubeComments"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="9.3.001" expanded="true" height="82" name="Nominal to Text" width="90" x="45" y="187">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="comment"/>
        <parameter key="attributes" value="|comment|product_name|product_type"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="45" y="34">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="TF-IDF"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="1000"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="select_attributes_and_weights" value="false"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.2.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="8.2.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34">
            <parameter key="transform_to" value="lower case"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.2.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
          <operator activated="true" class="text:stem_porter" compatibility="8.2.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="447" y="34"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="8.2.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="581" y="34">
            <parameter key="max_length" value="2"/>
          </operator>
          <operator activated="true" class="text:filter_by_length" compatibility="8.2.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="782" y="34">
            <parameter key="min_chars" value="2"/>
            <parameter key="max_chars" value="25"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
          <connect from_op="Stem (Porter)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="store" compatibility="9.3.001" expanded="true" height="68" name="Store Wordlist" width="90" x="782" y="85">
        <parameter key="repository_entry" value="../Results/WordlistForCR"/>
      </operator>
      <operator activated="true" class="numerical_to_polynominal" compatibility="9.3.001" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Customer Requirement"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="map" compatibility="9.3.001" expanded="true" height="82" name="Map" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Customer Requirement"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <list key="value_mappings">
          <parameter key="1" value="true"/>
          <parameter key="0" value="false"/>
        </list>
        <parameter key="consider_regular_expressions" value="false"/>
        <parameter key="add_default_mapping" value="false"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="447" y="34">
        <parameter key="attribute_name" value="Customer Requirement"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles">
          <parameter key="comment_id" value="id"/>
        </list>
      </operator>
      <operator activated="true" class="store" compatibility="9.3.001" expanded="true" height="68" name="Store" width="90" x="581" y="34">
        <parameter key="repository_entry" value="../Data/PrepedTrainingDataYouTubeComments"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Numerical to Polynominal" to_port="example set input"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_op="Store Wordlist" to_port="input"/>
      <connect from_op="Store Wordlist" from_port="through" to_port="result 2"/>
      <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Map" to_port="example set input"/>
      <connect from_op="Map" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>


And this Process shows my Preprocessing for my Data I want to classify:

<?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve WordlistForCR" width="90" x="45" y="187">
        <parameter key="repository_entry" value="../Results/WordlistForCR"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve DataYouTubeComments" width="90" x="112" y="289">
        <parameter key="repository_entry" value="../Data/DataYouTubeComments"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="9.3.001" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="289">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="comment"/>
        <parameter key="attributes" value="|comment|product_name|product_type"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="nominal"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="file_path"/>
        <parameter key="block_type" value="single_value"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="single_value"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="246" y="187">
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="TF-IDF"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="none"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="select_attributes_and_weights" value="false"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.2.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="34">
            <parameter key="mode" value="non letters"/>
            <parameter key="characters" value=".:"/>
            <parameter key="language" value="English"/>
            <parameter key="max_token_length" value="3"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="8.2.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="380" y="34">
            <parameter key="transform_to" value="lower case"/>
          </operator>
          <operator activated="true" class="text:stem_porter" compatibility="8.2.000" expanded="true" height="68" name="Stem (Porter) (2)" width="90" x="581" y="34"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="8.2.000" expanded="true" height="68" name="Generate n-Grams (Terms) (2)" width="90" x="715" y="34">
            <parameter key="max_length" value="2"/>
          </operator>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Stem (Porter) (2)" to_port="document"/>
          <connect from_op="Stem (Porter) (2)" from_port="document" to_op="Generate n-Grams (Terms) (2)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms) (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="generate_empty_attribute" compatibility="9.3.001" expanded="true" height="82" name="Generate Empty Attribute" width="90" x="380" y="187">
        <parameter key="name" value="Customer Requirement"/>
        <parameter key="value_type" value="polynominal"/>
      </operator>
      <operator activated="true" class="map" compatibility="9.3.001" expanded="true" height="82" name="Map" width="90" x="514" y="187">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Customer Requirement"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <list key="value_mappings">
          <parameter key="1" value="true"/>
          <parameter key="0" value="false"/>
        </list>
        <parameter key="consider_regular_expressions" value="false"/>
        <parameter key="add_default_mapping" value="false"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="715" y="187">
        <parameter key="attribute_name" value="Customer Requirement"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles">
          <parameter key="comment_id" value="id"/>
        </list>
      </operator>
      <operator activated="true" class="store" compatibility="9.3.001" expanded="true" height="68" name="Store" width="90" x="849" y="187">
        <parameter key="repository_entry" value="../Data/PrepedYouTubeCommentData"/>
      </operator>
      <connect from_op="Retrieve WordlistForCR" from_port="output" to_op="Process Documents from Data" to_port="word list"/>
      <connect from_op="Retrieve DataYouTubeComments" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Generate Empty Attribute" to_port="example set input"/>
      <connect from_op="Generate Empty Attribute" from_port="example set output" to_op="Map" to_port="example set input"/>
      <connect from_op="Map" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Store" to_port="input"/>
      <connect from_op="Store" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I also attached a 100 rows of my sample Data. I apologize if this problem has already been solved (Couldn't find anything useful for my situation) or if I made some simple mistake.

If someone knows how to correct english spelling, (I already tried the python script using textblob posted in the rapid Miner community. It changes words that are already correct for example "Big"  to "Fig") I would also be really grateful.

Thanks in advance,

Tennessee 





Tagged:

Best Answer

  • TennesseeTennessee Member Posts: 5 Newbie
    Solution Accepted
    Okay so I copied the svm model operator from the automodell process that was created into a cross validation and created a another model. In this model I can feed the data that I need to classify, without getting the error message. Now it works smoothly. But still 3 days wasted.

    Hence I recommend if you have problems with the model created by the automodler, copy the model into a cross validation. 

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @Tennessee ok I was able to look at this. Your Process Documents preprocessing is NOT the same in your training and testing, which of course it needs to be.





    Scott
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Also be very careful with wordlists.  You really need to store the wordlist from your original model construction process and then make sure you use that same wordlist when applying the model in the future, otherwise differences in the text you are processing can lead to incompatible results.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • TennesseeTennessee Member Posts: 5 Newbie
    The reason the preprocessing steps are not the same is due the wordlist I saved while creating the training data to create the model. I used this wordlist as input for the operator 'process documents from data'  which allows me to leave out certain preprocessing steps as per rapid miner text mining tutorial on YouTube. Also if I hadn't used the wordlist and the same preprocessing steps for both training and real data I would have gotten different attributes in my prepped tables. Are you sure this is the problem? 
  • TennesseeTennessee Member Posts: 5 Newbie
    If you compare both my last two processes you can see that I store a wordlist in the first process and use it in the second one. Also the  first picture shows that I have the same amount of attributes in both examples. This would not be possible if I used the generate n gram operator in both preprocessing steps without a wordlist wouldn't it?

    Thanks in advance, 

    Tennessee 



  • TennesseeTennessee Member Posts: 5 Newbie
    I have also tried feeding the model with training data and I still get the same error message.

    I used rapidminer automodell to create this model and it can't even except its own training data. Something seems very wrong. I'll try manually creating the model. 
Sign In or Register to comment.