process failed when applying built model for test data

huaiyanggongzihuaiyanggongzi Member Posts: 39 Contributor II
edited November 2018 in Help
I was trying to build several types of classifiers, including SVM, Naive-Bayes, and Neural Network. The training processes for these models have been finished successfully. However, when I try to apply the built model for testing purposes, some of them are failed. In specific, the trained SVM model can be applied to the testing set as normal. However, the process gets failed when applying trained Naive Bayes model to the test data set. I launched the Rapidminer process as follows, which are the same for different models
java -Xmx30g -jar "C:\Program Files\Rapid-I\RapidMiner5\lib\rapidminer.jar"
The error message was like

Oct 29, 2012 6:26:49 PM com.rapidminer.tools.WrapperLoggingHandler logWarning
WARNING: KernelDistribution: The given example set does not contain a regular at
tribute with name 'êδÜ_∞'. This might cause problems for some models depending o
n this particular attribute.
Oct 29, 2012 6:26:49 PM com.rapidminer.tools.WrapperLoggingHandler logWarning
WARNING: KernelDistribution: The given example set does not contain a regular at
tribute with name 'ê∞'. This might cause problems for some models depending on t
his particular attribute.
Oct 29, 2012 6:26:49 PM com.rapidminer.tools.WrapperLoggingHandler logWarning
WARNING: KernelDistribution: The given example set does not contain a regular at
tribute with name 'ê∞_δ'. This might cause problems for some models depending on
this particular attribute.
Oct 29, 2012 6:26:49 PM com.rapidminer.tools.WrapperLoggingHandler logWarning
WARNING: KernelDistribution: The given example set does not contain a regular at
tribute with name 'ê∞_∞'. This might cause problems for some models depending on
this particular attribute.
Oct 29, 2012 6:26:49 PM com.rapidminer.tools.WrapperLoggingHandler logWarning
WARNING: KernelDistribution: The given example set does not contain a regular at
tribute with name 'ê∞Ü'. This might cause problems for some models depending on
this particular attribute.
Oct 29, 2012 6:26:49 PM com.rapidminer.tools.WrapperLoggingHandler logWarning
WARNING: KernelDistribution: The given example set does not contain a regular at
tribute with name 'ê∞Ü╡δ'. This might cause problems for some models depending o
n this particular attribute.
Oct 29, 2012 6:26:49 PM com.rapidminer.tools.WrapperLoggingHandler logWarning
WARNING: KernelDistribution: The given example set does not contain a regular at
tribute with name 'ê∞Ü╡δ_êδ'. This might cause problems for some models dependin
g on this particular attribute.
Oct 29, 2012 6:26:49 PM com.rapidminer.tools.WrapperLoggingHandler logWarning
WARNING: KernelDistribution: The given example set does not contain a regular at
tribute with name 'êφ'. This might cause problems for some models depending on t
his particular attribute.
Oct 29, 2012 6:26:49 PM com.rapidminer.tools.WrapperLoggingHandler logWarning
WARNING: KernelDistribution: The given example set does not contain a regular at
tribute with name 'êφ_δ'. This might cause problems for some models depending on
this particular attribute.
Oct 29, 2012 6:26:49 PM com.rapidminer.tools.WrapperLoggingHandler logWarning
WARNING: KernelDistribution: The given example set does not contain a regular at
tribute with name 'ê∩'. This might cause problems for some models depending on t
his particular attribute.
Oct 29, 2012 6:26:49 PM com.rapidminer.tools.WrapperLoggingHandler logWarning
WARNING: KernelDistribution: The given example set does not contain a regular at
tribute with name 'ê∩_£'. This might cause problems for some models depending on
this particular attribute.
Oct 29, 2012 6:26:49 PM com.rapidminer.tools.WrapperLoggingHandler logWarning
WARNING: KernelDistribution: The given example set does not contain a regular at
tribute with name 'ê∩_£Σ'. This might cause problems for some models depending o
n this particular attribute.
Oct 29, 2012 6:26:49 PM com.rapidminer.gui.ProcessThread run
SEVERE: Process failed: Input example set does not have a label attribute
com.rapidminer.operator.UserError: Input example set does not have a label attri
bute
        at com.rapidminer.example.Tools.isLabelled(Tools.java:380)
        at com.rapidminer.operator.performance.PolynominalClassificationPerforma
nceEvaluator.checkCompatibility(PolynominalClassificationPerformanceEvaluator.ja
va:103)
        at com.rapidminer.operator.performance.AbstractPerformanceEvaluator.doWo
rk(AbstractPerformanceEvaluator.java:234)
        at com.rapidminer.operator.Operator.execute(Operator.java:833)
        at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUn
itExecutor.java:51)
        at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:709)

        at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:369)
        at com.rapidminer.operator.Operator.execute(Operator.java:833)
        at com.rapidminer.Process.run(Process.java:920)
        at com.rapidminer.Process.run(Process.java:843)
        at com.rapidminer.Process.run(Process.java:802)
        at com.rapidminer.Process.run(Process.java:797)
        at com.rapidminer.Process.run(Process.java:787)
        at com.rapidminer.gui.ProcessThread.run(ProcessThread.java:63)

Oct 29, 2012 6:26:49 PM com.rapidminer.gui.ProcessThread run
SEVERE: Here:          Process[1] (Process)
          subprocess 'Main Process'
            +- Retrieve[1] (Retrieve)
            +- Process Documents from Files (2)[1] (Process Documents from File
s)
          subprocess 'Vector Creation'
            |    +- Tokenize (2)[0] (Tokenize)
            |    +- Transform Cases (2)[0] (Transform Cases)
            |    +- Filter Stopwords (English)[0] (Filter Stopwords (English))

            |    +- Generate n-Grams (Terms)[0] (Generate n-Grams (Terms))
            +- Retrieve (2)[1] (Retrieve)
            +- Apply Model[1] (Apply Model)
      ==>  +- Performance[1] (Performance (Classification))
            +- Select Attributes[0] (Select Attributes)
            +- Write CSV[0] (Write CSV)
The model application workflow are the same for different models, except that we use different models. The workflow is here
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="386" width="711">
      <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve" width="90" x="45" y="75">
        <parameter key="repository_entry" value="nb_Train_F_words"/>
      </operator>
      <operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="179" y="75">
        <list key="text_directories">
          <parameter key="Responsive" value="C:\Validation Sets\total responsive"/>
          <parameter key="NonResponsive" value="C:\Validation Sets\Not Resp"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="5"/>
        <parameter key="prune_above_absolute" value="5000000"/>
        <parameter key="prune_below_rank" value="5.0"/>
        <parameter key="prune_above_rank" value="5.0"/>
        <process expanded="true" height="362" width="674">
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases (2)" width="90" x="180" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="315" y="73"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.2.004" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="447" y="165"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="retrieve" compatibility="5.2.008" expanded="true" height="60" name="Retrieve (2)" width="90" x="179" y="300">
        <parameter key="repository_entry" value="nb_Train_F_model"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.2.008" expanded="true" height="76" name="Apply Model" width="90" x="313" y="300">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="5.2.008" expanded="true" height="76" name="Performance" width="90" x="447" y="75">
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="165">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|confidence(non_res)|confidence(res)|label|prediction(label)"/>
      </operator>
      <operator activated="true" class="write_csv" compatibility="5.2.008" expanded="true" height="76" name="Write CSV" width="90" x="581" y="210">
        <parameter key="csv_file" value="C:\Users\Desktop\rapidminerRepository\Project1\Total responsive - naivebayes\scorevalue_naiveBayesian.csv"/>
        <parameter key="column_separator" value=","/>
        <parameter key="quote_nominal_values" value="false"/>
        <parameter key="format_date_attributes" value="false"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Process Documents from Files (2)" to_port="word list"/>
      <connect from_op="Process Documents from Files (2)" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Retrieve (2)" from_port="output" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Performance" from_port="performance" to_port="result 2"/>
      <connect from_op="Performance" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Write CSV" to_port="input"/>
      <connect from_op="Write CSV" from_port="through" to_port="result 1"/>

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    the process you have posted is not complete, probably the max size of the post has been reached. However, the output says:
    SEVERE: Process failed: Input example set does not have a label attribute
    com.rapidminer.operator.UserError: Input example set does not have a label attri
    bute
    As you can see, that happened in the Performance operator:
    Oct 29, 2012 6:26:49 PM com.rapidminer.gui.ProcessThread run
    SEVERE: Here:          Process[1] (Process)
              subprocess 'Main Process'
                +- Retrieve[1] (Retrieve)
                +- Process Documents from Files (2)[1] (Process Documents from File
    s)
              subprocess 'Vector Creation'
                |    +- Tokenize (2)[0] (Tokenize)
                |    +- Transform Cases (2)[0] (Transform Cases)
                |    +- Filter Stopwords (English)[0] (Filter Stopwords (English))

                |    +- Generate n-Grams (Terms)[0] (Generate n-Grams (Terms))
                +- Retrieve (2)[1] (Retrieve)
                +- Apply Model[1] (Apply Model)
          ==>  +- Performance[1] (Performance (Classification))
                +- Select Attributes[0] (Select Attributes)
                +- Write CSV[0] (Write CSV)
    So obviously your test data does not contain a label, but nevertheless you are trying to measure the performance. Since all your classifiers are supervised learners, the performance can only be measured by comparing the prediction and the original label. So you have either to make sure that your data contains a label, or you can't estimate the performance on the test set.

    Best, Marius
  • hectorbernalhectorbernal Member Posts: 1 Contributor I
    I have a similar problem to this one, so I post here. I have a test set and a data set. I use the operator "Process document from data" to generate a bag of words for the training set and one for the test set. I then generate a model using an algorithm (Naive Bayase, W-J48, k-NN, SVM or Neural Networks) and finally I test the model on the BoW generated from the test set.

    This gives me a bunch of warnings:
    Dec 11, 2014 2:55:06 PM WARNING: SimpleDistribution: The number of regular attributes of the given example set does not fit the number of attributes of the training example set, training: 423, application: 348
    Dec 11, 2014 2:55:06 PM WARNING: SimpleDistribution: The given example set does not contain a regular attribute with name 'age'. This might cause problems for some models depending on this particular attribute.
    Dec 11, 2014 2:55:06 PM WARNING: SimpleDistribution: The given example set does not contain a regular attribute with name 'answer'. This might cause problems for some models depending on this particular attribute.
    ...
    Dec 11, 2014 2:55:06 PM WARNING: SimpleDistribution: The given example set does not contain a regular attribute with name 'yeah_i'. This might cause problems for some models depending on this particular attribute.
    Dec 11, 2014 2:55:06 PM INFO: Saving results.
    Dec 11, 2014 2:55:06 PM INFO: Process //speciale/Training - Test process finished successfully after 1 s
    And using the decision tree I get these errors:
    Dec 11, 2014 4:55:31 PM SEVERE: W-J48: Exception occured while classifying example:null [class java.lang.ArrayIndexOutOfBoundsException]
    Dec 11, 2014 4:55:31 PM SEVERE: W-J48: Exception occured while classifying example:null [class java.lang.ArrayIndexOutOfBoundsException]
    Dec 11, 2014 4:55:31 PM SEVERE: W-J48: Exception occured while classifying example:null [class java.lang.ArrayIndexOutOfBoundsException]
    ...
    Can I somehow add the list of attributes generated from the training set to one of the test set tho avoid this problem.

    My process file:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
        <parameter key="parallelize_main_process" value="true"/>
        <process expanded="true">
          <operator activated="true" class="read_csv" compatibility="5.3.015" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
            <parameter key="csv_file" value="/home/hector/git/datamining/predator_project/predator_project/src/perl/csv_files/02w15_files/w15_TRAINING_no_fold_from70pc.csv"/>
            <parameter key="column_separators" value=","/>
            <list key="annotations"/>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="bully.true.binominal.label"/>
              <parameter key="1" value="senderID.true.polynominal.attribute"/>
              <parameter key="18" value="message.true.text.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.3.015" expanded="true" height="76" name="Set Role" width="90" x="112" y="75">
            <parameter key="attribute_name" value="bully"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles">
              <parameter key="senderID" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="5.3.015" expanded="true" height="76" name="Nominal to Text (2)" width="90" x="179" y="120">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="message"/>
          </operator>
          <operator activated="false" class="visualize_model_by_som" compatibility="5.3.015" expanded="true" height="94" name="Visualize Model by SOM" width="90" x="112" y="570"/>
          <operator activated="false" class="write_csv" compatibility="5.3.015" expanded="true" height="76" name="Write CSV (2)" width="90" x="45" y="165"/>
          <operator activated="false" class="write_csv" compatibility="5.3.015" expanded="true" height="76" name="Write CSV" width="90" x="45" y="480">
            <parameter key="csv_file" value="/home/hector/Dropbox/ITU/DataMining/result_ling_nn.csv"/>
            <parameter key="quote_nominal_values" value="false"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="read_csv" compatibility="5.3.015" expanded="true" height="60" name="Read CSV (2)" width="90" x="45" y="255">
            <parameter key="csv_file" value="/home/hector/git/datamining/predator_project/predator_project/src/perl/csv_files/02w15_files/w15_TEST_no_fold_from30pc.csv"/>
            <parameter key="column_separators" value=","/>
            <list key="annotations"/>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="bully.true.binominal.label"/>
              <parameter key="1" value="senderID.true.polynominal.attribute"/>
              <parameter key="18" value="message.true.text.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.3.015" expanded="true" height="76" name="Set Role (2)" width="90" x="112" y="300">
            <parameter key="attribute_name" value="bully"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles">
              <parameter key="senderID" value="id"/>
            </list>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="5.3.015" expanded="true" height="76" name="Nominal to Text (3)" width="90" x="179" y="345">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="message"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data (4)" width="90" x="179" y="210">
            <parameter key="prune_method" value="percentual"/>
            <parameter key="prune_below_percent" value="1.0"/>
            <parameter key="prune_above_percent" value="99.0"/>
            <parameter key="prune_below_absolute" value="9"/>
            <parameter key="prune_above_absolute" value="999999"/>
            <list key="specify_weights">
              <parameter key="message" value="1.0"/>
            </list>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (4)" width="90" x="45" y="30"/>
              <operator activated="false" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (4)" width="90" x="180" y="30"/>
              <operator activated="false" class="wordnet:open_wordnet_dictionary" compatibility="5.2.000" expanded="true" height="60" name="Open WordNet Dictionary (4)" width="90" x="315" y="30">
                <parameter key="directory" value="C:\Program Files (x86)\WordNet\2.1\dict"/>
              </operator>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (4)" width="90" x="45" y="120"/>
              <operator activated="false" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="Replace Tokens (3)" width="90" x="180" y="120">
                <list key="replace_dictionary">
                  <parameter key="se" value="sex"/>
                  ...
                  <parameter key="seks" value="sex"/>
                </list>
              </operator>
              <operator activated="false" class="wordnet:find_hypernym_wordnet" compatibility="5.2.000" expanded="true" height="76" name="Find Hypernyms (4)" width="90" x="315" y="120"/>
              <operator activated="true" class="text:stem_snowball" compatibility="5.3.002" expanded="true" height="60" name="Stem (4)" width="90" x="45" y="210"/>
              <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (4)" width="90" x="180" y="210">
                <parameter key="min_chars" value="1"/>
                <parameter key="max_chars" value="20"/>
              </operator>
              <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (4)" width="90" x="323" y="120"/>
              <connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
              <connect from_op="Tokenize (4)" from_port="document" to_op="Filter Stopwords (4)" to_port="document"/>
              <connect from_op="Filter Stopwords (4)" from_port="document" to_op="Stem (4)" to_port="document"/>
              <connect from_op="Stem (4)" from_port="document" to_op="Filter Tokens (4)" to_port="document"/>
              <connect from_op="Filter Tokens (4)" from_port="document" to_op="Generate n-Grams (4)" to_port="document"/>
              <connect from_op="Generate n-Grams (4)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="179" y="435">
            <parameter key="prune_method" value="percentual"/>
            <parameter key="prune_below_percent" value="1.0"/>
            <parameter key="prune_above_percent" value="99.0"/>
            <parameter key="prune_below_absolute" value="9"/>
            <parameter key="prune_above_absolute" value="999999"/>
            <list key="specify_weights">
              <parameter key="message" value="1.0"/>
            </list>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
              <operator activated="false" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="180" y="30"/>
              <operator activated="false" class="wordnet:open_wordnet_dictionary" compatibility="5.2.000" expanded="true" height="60" name="Open WordNet Dictionary (2)" width="90" x="315" y="30">
                <parameter key="directory" value="C:\Program Files (x86)\WordNet\2.1\dict"/>
              </operator>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="45" y="120"/>
              <operator activated="false" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="Replace Tokens (2)" width="90" x="180" y="120">
                <list key="replace_dictionary">
                  <parameter key="se" value="sex"/>
                 ...
                  <parameter key="seks" value="sex"/>
                </list>
              </operator>
              <operator activated="false" class="wordnet:find_hypernym_wordnet" compatibility="5.2.000" expanded="true" height="76" name="Find Hypernyms (2)" width="90" x="315" y="120"/>
              <operator activated="true" class="text:stem_snowball" compatibility="5.3.002" expanded="true" height="60" name="Stem (2)" width="90" x="45" y="210"/>
              <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="180" y="210">
                <parameter key="min_chars" value="1"/>
                <parameter key="max_chars" value="20"/>
              </operator>
              <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (2)" width="90" x="323" y="120"/>
              <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
              <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
              <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
              <connect from_op="Stem (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
              <connect from_op="Filter Tokens (2)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
              <connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="false" class="naive_bayes" compatibility="5.3.015" expanded="true" height="76" name="Naive Bayes (3)" width="90" x="313" y="75"/>
          <operator activated="false" class="k_nn" compatibility="5.3.015" expanded="true" height="76" name="k-NN (2)" width="90" x="313" y="255"/>
          <operator activated="false" class="support_vector_machine" compatibility="5.3.015" expanded="true" height="112" name="SVM (3)" width="90" x="313" y="345"/>
          <operator activated="true" class="weka:W-J48" compatibility="5.3.001" expanded="true" height="76" name="W-J48 (2)" width="90" x="313" y="165"/>
          <operator activated="false" class="neural_net" compatibility="5.3.015" expanded="true" height="76" name="Neural Net" width="90" x="313" y="480">
            <list key="hidden_layers"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="5.3.015" expanded="true" height="76" name="Apply Model (2)" width="90" x="447" y="255">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.3.015" expanded="true" height="76" name="Performance (2)" width="90" x="581" y="255"/>
          <connect from_op="Read CSV" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text (2)" to_port="example set input"/>
          <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (4)" to_port="example set"/>
          <connect from_op="Read CSV (2)" from_port="output" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Nominal to Text (3)" to_port="example set input"/>
          <connect from_op="Nominal to Text (3)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
          <connect from_op="Process Documents from Data (4)" from_port="example set" to_op="W-J48 (2)" to_port="training set"/>
          <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Process Documents from Data (2)" from_port="word list" to_port="result 2"/>
          <connect from_op="W-J48 (2)" from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
          <connect from_op="Performance (2)" from_port="performance" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    I hope you can help me!
    PS. I have deleted few tings from the process file in order to have less that 20.000 characters
  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    The word list output from the Process Documents of the training data set needs to be passed to the Process Documents where the test data set is created.

    This ensures that the attributes used to build the model from the training data are the same when the model is applied to test data.

    regards,

    Andrew
  • Elisa0815Elisa0815 Member Posts: 10 Contributor II
    I know that this is an old topic but I've got the same problem when I wanted to use RapidMiner for a sentiment analysis.
    I could also solve the problem by connecting the words of the testset to the operator, which preprocess the trainingset. My problem now is that I don't understand WHY I need to do that.

    A classifier is in the end a mathematical function, containing of numbers and operators. After be trained, it doesn't need any attributes of the trainingset anymore, right? After training, the parameters, like C, are set, so it only needs to read the unknown X of the testset and compute the result, which is the label.
    So why does it need the words of the testset?

    Can someone may help me to understand that?
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Correct, but the X of the testset needs to have the same semantics as the X of the training data, i.e. the table must contain the TF/IDF values for the same words as the training data. Hence the Process Documents operator must "know" which words have been used in training to be able to create a compatible word vector. And that is why you have to connect the word vector port.

    Does that help?

    Regards,
    Marius
  • Elisa0815Elisa0815 Member Posts: 10 Contributor II
    Yes, it helps. Thank you very much :)
Sign In or Register to comment.