operator cannot be executed (Duplicate attribute name: cluster)

dranammari · January 2012

Hi everybody,

I am running a RapidMiner process that uses kmeans clustering to cluster a set of discussions. I want to extract the cluster centroids and save them to a CSV file for further programming in Java. Therefore, I have added two operators: Extract Cluster Prototypes, and Write CSV. Now I am having a Process Failed error. Here is the log messages:

Jan 17, 2012 2:34:08 PM SEVERE: Process failed: operator cannot be executed (Duplicate attribute name: cluster). Check the log messages...
Jan 17, 2012 2:34:08 PM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
+- Read Database[1] (Read Database)
+- Rename[1] (Rename)
+- Set Role[1] (Set Role)
+- Data to Documents[1] (Data to Documents)
+- Process Documents[1] (Process Documents)
subprocess 'Vector Creation'
| +- Extract Content[9708] (Extract Content)
| +- Tokenize[9708] (Tokenize)
| +- Transform Cases[9708] (Transform Cases)
| +- Filter Stopwords (English)[9708] (Filter Stopwords (English))
| +- Filter Stopwords (Dictionary)[9708] (Filter Stopwords (Dictionary))
| +- Filter Tokens (by Length)[9708] (Filter Tokens (by Length))
| +- Generate n-Grams (Terms)[9708] (Generate n-Grams (Terms))
+- Clustering[1] (k-Means)
==> +- Extract Cluster Prototypes[1] (Extract Cluster Prototypes)
+- Write CSV[0] (Write CSV)
+- Select Attributes[0] (Select Attributes)
+- Write Database[0] (Write Database)

As you can see, the error says that: an operator cannot be executed (Duplicate attribute name: cluster). If we check the logs, an arrow points to the Extract Cluster Prototypes operator.

Can you please tell me what the problem might be and how to solve it? Is this a bug in the Extract Cluster Prototypes operator?
The process runs successfully and generates the clustering model without extracting the centroids.

Many thanks,
Ahmad

MariusHelf · January 2012

Hi Ahmad,

please post your process setup, so we can have a look at it. Just copy the contents of the XML tab on top of the process pane into your next post (use the #-button on top of the input field here in the forum).

Best,
Marius

dranammari · January 2012

Hi Marius,

It looks like I discovered the reason of the error, which is somehow "weird" to me!

In the text analysis step (the Process Documents operator), one of the resulted tokens is 'cluster', which is the same name as the 'cluster' attribute that will store the cluster number of each document after the clustering process. How did I discover this? I inserted the word 'cluster' in the Stop Word dictionary I am using for the Filter Tokens (Dictionary) operator, and the process run successfully!! Therefore, the Extract Cluster Prototypes operator fails to execute if it finds a token in the ExampleSet having the same name as the word 'cluster', which is the attribute storing the cluster labels!

Is this a bug in the operator? Ofcourse I don't have to consider the word 'cluster' as a stop word to solve this problem as this word is obviously not a stop word!

Here is the process in XML:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
    <process expanded="true" height="415" width="681">
      <operator activated="true" class="read_database" compatibility="5.1.017" expanded="true" height="60" name="Read Database" width="90" x="45" y="30">
        <parameter key="connection" value="forums"/>
        <parameter key="query" value="SELECT `id`, `topic`, `detail`&#10;FROM `forum_question`"/>
        <enumeration key="parameters"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
      </operator>
      <operator activated="true" class="rename" compatibility="5.1.017" expanded="true" height="76" name="Rename" width="90" x="179" y="30">
        <parameter key="old_name" value="id"/>
        <parameter key="new_name" value="thread_id"/>
        <list key="rename_additional_attributes"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.1.017" expanded="true" height="76" name="Set Role" width="90" x="313" y="30">
        <parameter key="name" value="thread_id"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles">
          <parameter key="topic" value="regular"/>
          <parameter key="detail" value="regular"/>
        </list>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="5.1.004" expanded="true" height="60" name="Data to Documents" width="90" x="447" y="30">
        <list key="specify_weights"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.1.004" expanded="true" height="94" name="Process Documents" width="90" x="45" y="255">
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prunde_below_percent" value="2.0"/>
        <parameter key="prune_above_percent" value="94.0"/>
        <parameter key="prune_below_absolute" value="1"/>
        <parameter key="prune_above_absolute" value="195"/>
        <parameter key="prune_below_rank" value="1.0"/>
        <parameter key="prune_above_rank" value="99.0"/>
        <process expanded="true" height="415" width="567">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="165">
            <parameter key="mode" value="specify characters"/>
            <parameter key="characters" value=" '.:,;?!()[]{}/\ '"/>
            <parameter key="expression" value="[\s]"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="300"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="246" y="30"/>
          <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (Dictionary)" width="90" x="246" y="165">
            <parameter key="file" value="C:\Users\admin2\Documents\NetBeansProjects\Forums\StopWords_Enhanced.txt"/>
          </operator>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="246" y="300">
            <parameter key="min_chars" value="3"/>
            <parameter key="max_chars" value="999"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="447" y="165">
            <parameter key="condition" value="contains match"/>
            <parameter key="regular_expression" value="[a-zA-Z-]+"/>
          </operator>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.1.004" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="447" y="300"/>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
          <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.1.017" expanded="true" height="76" name="Clustering" width="90" x="179" y="165">
        <parameter key="k" value="12"/>
        <parameter key="max_runs" value="50"/>
        <parameter key="max_optimization_steps" value="500"/>
      </operator>
      <operator activated="true" class="extract_prototypes" compatibility="5.1.017" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="380" y="165"/>
      <operator activated="true" class="write_csv" compatibility="5.1.017" expanded="true" height="60" name="Write CSV" width="90" x="514" y="165">
        <parameter key="csv_file" value="C:\Users\admin2\Documents\NetBeansProjects\Forums\cluster_centroids.csv"/>
        <parameter key="column_separator" value=","/>
        <parameter key="quote_nominal_values" value="false"/>
        <parameter key="format_date_attributes" value="false"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.1.017" expanded="true" height="76" name="Select Attributes" width="90" x="380" y="300">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|thread_id|cluster"/>
      </operator>
      <operator activated="true" class="write_database" compatibility="5.1.017" expanded="true" height="60" name="Write Database" width="90" x="581" y="300">
        <parameter key="connection" value="forums"/>
        <parameter key="table_name" value="forum_question_clusters"/>
        <parameter key="overwrite_mode" value="overwrite"/>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Rename" to_port="example set input"/>
      <connect from_op="Rename" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Extract Cluster Prototypes" from_port="example set" to_op="Write CSV" to_port="input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Write Database" to_port="input"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

Many thanks,
Ahmad

MariusHelf · January 2012

Heya,

the Clustering operator always names his output attribute "cluster", which in your case is a bit sub-optimal. You could try to rename the attribute generated by Process Documents (if it exists) before applying clustering with a construction like this (not tested cause I don't have your data, but should work):

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.015" expanded="true" name="Process">
    <process expanded="true" height="611" width="681">
      <operator activated="true" class="read_database" compatibility="5.1.015" expanded="true" height="60" name="Read Database" width="90" x="45" y="30">
        <parameter key="connection" value="forums"/>
        <parameter key="query" value="SELECT `id`, `topic`, `detail`&#10;FROM `forum_question`"/>
        <enumeration key="parameters"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
      </operator>
      <operator activated="true" class="rename" compatibility="5.1.015" expanded="true" height="76" name="Rename" width="90" x="179" y="30">
        <parameter key="old_name" value="id"/>
        <parameter key="new_name" value="thread_id"/>
        <list key="rename_additional_attributes"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.1.015" expanded="true" height="76" name="Set Role" width="90" x="313" y="30">
        <parameter key="name" value="thread_id"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles">
          <parameter key="topic" value="regular"/>
          <parameter key="detail" value="regular"/>
        </list>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="5.1.004" expanded="true" height="60" name="Data to Documents" width="90" x="447" y="30">
        <list key="specify_weights"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.1.004" expanded="true" height="94" name="Process Documents" width="90" x="45" y="165">
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prunde_below_percent" value="2.0"/>
        <parameter key="prune_above_percent" value="94.0"/>
        <parameter key="prune_below_absolute" value="1"/>
        <parameter key="prune_above_absolute" value="195"/>
        <parameter key="prune_below_rank" value="1.0"/>
        <parameter key="prune_above_rank" value="99.0"/>
        <process expanded="true" height="415" width="567">
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.1.004" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="165">
            <parameter key="mode" value="specify characters"/>
            <parameter key="characters" value=" '.:,;?!()[]{}/\ '"/>
            <parameter key="expression" value="[\s]"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="300"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="246" y="30"/>
          <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (Dictionary)" width="90" x="246" y="165">
            <parameter key="file" value="C:\Users\admin2\Documents\NetBeansProjects\Forums\StopWords_Enhanced.txt"/>
          </operator>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="246" y="300">
            <parameter key="min_chars" value="3"/>
            <parameter key="max_chars" value="999"/>
          </operator>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="447" y="165">
            <parameter key="condition" value="contains match"/>
            <parameter key="regular_expression" value="[a-zA-Z-]+"/>
          </operator>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.1.004" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="447" y="300"/>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
          <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="handle_exception" compatibility="5.1.015" expanded="true" height="76" name="Handle Exception" width="90" x="112" y="300">
        <process expanded="true" height="633" width="346">
          <operator activated="true" class="rename" compatibility="5.1.015" expanded="true" height="76" name="Rename (2)" width="90" x="112" y="30">
            <parameter key="old_name" value="cluster"/>
            <parameter key="new_name" value="_cluster_"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <connect from_port="in 1" to_op="Rename (2)" to_port="example set input"/>
          <connect from_op="Rename (2)" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <process expanded="true" height="633" width="346">
          <connect from_port="in 1" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.1.015" expanded="true" height="76" name="Clustering" width="90" x="246" y="300">
        <parameter key="k" value="12"/>
        <parameter key="max_runs" value="50"/>
        <parameter key="max_optimization_steps" value="500"/>
      </operator>
      <operator activated="true" class="extract_prototypes" compatibility="5.1.015" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="447" y="165"/>
      <operator activated="true" class="write_csv" compatibility="5.1.015" expanded="true" height="60" name="Write CSV" width="90" x="581" y="165">
        <parameter key="csv_file" value="C:\Users\admin2\Documents\NetBeansProjects\Forums\cluster_centroids.csv"/>
        <parameter key="column_separator" value=","/>
        <parameter key="quote_nominal_values" value="false"/>
        <parameter key="format_date_attributes" value="false"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.1.015" expanded="true" height="76" name="Select Attributes" width="90" x="447" y="345">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|thread_id|cluster"/>
      </operator>
      <operator activated="true" class="write_database" compatibility="5.1.015" expanded="true" height="60" name="Write Database" width="90" x="581" y="345">
        <parameter key="connection" value="forums"/>
        <parameter key="table_name" value="forum_question_clusters"/>
        <parameter key="overwrite_mode" value="overwrite"/>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Rename" to_port="example set input"/>
      <connect from_op="Rename" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Handle Exception" to_port="in 1"/>
      <connect from_op="Handle Exception" from_port="out 1" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
      <connect from_op="Clustering" from_port="clustered set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Extract Cluster Prototypes" from_port="example set" to_op="Write CSV" to_port="input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Write Database" to_port="input"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

operator cannot be executed (Duplicate attribute name: cluster)

Answers