Options

When do process problems affect process results?

awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
edited November 2018 in Help
Hello all,

I have the following process

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
    <process expanded="true" height="682" width="759">
      <operator activated="true" class="generate_data" compatibility="5.0.8" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30"/>
      <operator activated="true" class="discretize_by_bins" compatibility="5.0.8" expanded="true" height="94" name="Discretize" width="90" x="179" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="label"/>
        <parameter key="attributes" value="label"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="nominal_to_binominal" compatibility="5.0.8" expanded="true" height="94" name="Nominal to Binominal" width="90" x="313" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="label"/>
        <parameter key="attributes" value="label"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="x_validation" compatibility="5.0.0" expanded="true" height="112" name="Validation" width="90" x="447" y="30">
        <description>A cross-validation evaluating a decision tree model.</description>
        <process expanded="true" height="654" width="466">
          <operator activated="true" class="logistic_regression" compatibility="5.0.8" expanded="true" height="94" name="Logistic Regression" width="90" x="188" y="30"/>
          <connect from_port="training" to_op="Logistic Regression" to_port="training set"/>
          <connect from_op="Logistic Regression" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="654" width="466">
          <operator activated="true" class="apply_model" compatibility="5.0.0" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="5.0.0" expanded="true" height="76" name="Performance" width="90" x="179" y="30"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Discretize" to_port="example set input"/>
      <connect from_op="Discretize" from_port="example set output" to_op="Nominal to Binominal" to_port="example set input"/>
      <connect from_op="Nominal to Binominal" from_port="example set output" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_port="result 1"/>
      <connect from_op="Validation" from_port="training" to_port="result 2"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

It has the error "Input example set must have special attribute 'label'."

I can easily get rid of this by changing the parameter "range name type" from long to short. I suppose this is a bug and I often encounter little issues like this that I have to workaround. I don't mind too much given the open source status of the product.

I run the process in the two cases with and without the error and I get the same performance vector so on the face of it, the error has no effect. My question is however, is this always true? What validation errors can I safely ignore to save myself workaround time?

regards

Andrew

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Andrew,
    this is not a bug, simply a unsolvable artifact coming from the problem itself. Please be aware that the errors shown in the window below are errors found during a dry run. That means, that the data isn't loaded at all, but only the meta data is taken into account. So if all attribute names are known in the meta data and you enter an attribute's name, who is definitively NOT part of the example set, it gives you an error. But in some cases the existence of an attribute depends on the data itself. For example if you discretize and choose to have the range in the name of the attribute. During a dry run, without taking a look at the data, you cannot determine if the attribute exists. Hence an error is shown. But everything might work fine, if you run the process and the actual data is processed.

    But we will add more text explaining that some errors are merely warnings.

    Greetings,
      Sebastian
  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello Sebastian,

    I can see that if the attribute name depends on the data then there's no chance for the meta data to know what is going on. In this case, the attribute is always called "label" regardless of the range name type in the discretize operator. Changing this parameter has the effect of changing the possible nominal values the attribute can take. These values depend on the data but the error I am seeing is "Input example set must have special attribute 'label'.". Does this mean that the meta data gets confused and can't determine that the attribute is a label if the possible nominal values cannot be determined.

    regards

    Andrew
  • Options
    haddockhaddock Member Posts: 849 Maven
    Greets Chaps,

    This probably is more of what I'd call a quirk - sort of funny when you think about it. If you disable the discretizing altogether that pesky red light is still there on the validation box, like this...

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
        <process expanded="true" height="682" width="759">
          <operator activated="true" class="generate_data" compatibility="5.0.8" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
            <parameter key="target_function" value="sinus"/>
          </operator>
          <operator activated="false" class="discretize_by_bins" compatibility="5.0.8" expanded="true" height="94" name="Discretize" width="90" x="179" y="165">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value="label"/>
            <parameter key="attributes" value="label"/>
            <parameter key="include_special_attributes" value="true"/>
            <parameter key="range_name_type" value="interval"/>
          </operator>
          <operator activated="false" class="nominal_to_binominal" compatibility="5.0.8" expanded="true" height="94" name="Nominal to Binominal" width="90" x="380" y="165">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value="label"/>
            <parameter key="attributes" value="label"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="x_validation" compatibility="5.0.0" expanded="true" height="112" name="Validation" width="90" x="581" y="30">
            <description>A cross-validation evaluating a decision tree model.</description>
            <process expanded="true" height="654" width="466">
              <operator activated="true" class="logistic_regression" compatibility="5.0.8" expanded="true" height="94" name="Logistic Regression" width="90" x="188" y="30"/>
              <connect from_port="training" to_op="Logistic Regression" to_port="training set"/>
              <connect from_op="Logistic Regression" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="654" width="466">
              <operator activated="true" class="apply_model" compatibility="5.0.0" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.0.0" expanded="true" height="76" name="Performance" width="90" x="179" y="30"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="model" to_port="result 1"/>
          <connect from_op="Validation" from_port="training" to_port="result 2"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
    How to turn it off? Simples! Change the target function to anything that ends with the word 'classification' !!

    See what I mean by quirky?


Sign In or Register to comment.