RapidMiner

Clustering - weird salary values

Contributor II

Clustering - weird salary values

Hi guys,

 

I am new to Rapidminer, my first little project is clustering some of our customers. Attributes are like age, salary, job position and the amount of credit they have. In order to have a faster process, I have "translated" nominal data (like employement, channel) to numerical data.

 

I have run the clustering model, and except salary I got seemingly coherent data. 

 

Here is my centroid table:

Attributes cluster_0 cluster_1 cluster_2 cluster_3 cluster_4 cluster_5
Channel 1.0 0.0 1.0 1.0 3.0 0.0
Accomodation 1.0 2.0 2.0 2.0 3.0 1.0
Education 1.0 1.0 1.0 1.0 1.0 1.0
Employement 2.0 2.0 3.0 1.0 1.0 2.0
Credit (pcs) 4.0 1.0 2.0 1.0 2.0 2.0
Age 36.0 40.0 44.0 65.0 78.0 41.0
Salary 3417.0 4174.0 3100.0 79.0 226.0 8601.0
Credit (vol) 1181014.0 0.0 8690185.0 0.0 362750.0 3622658.0

 

What surprised me was the values in the Salary row - they should be much higher than that, the average salary in my data table for these customers are well above 188 k.

My question is: is there somethig I am missing or am I interpreteing the data wrong?

Thanks for the answers!

 

Tibor

6 REPLIES
Community Manager

Re: Clustering - weird salary values

Well yes, something sounds weird but without seeing your data and process it's hard to guess here.


Question: how did you convert your nominal values into numerical?  I'm assuming you're using K-means. If you select mixed measures you can use the Nominal and Numerical values without modifying them. 

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Contributor II

Re: Clustering - weird salary values

Hi T-Bone,

 

First of all, thanks for the quick answer. Smiley Happy The conversion of nominal data to numerical was inspired by the fact that in my experience it took longer for Rapidminer to run the process, but next time I will try your suggestion.

 

The xml of my process is down below:

?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="701" width="955">
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="48" y="422">
<parameter key="repository_entry" value="Elemzés - RCC KHR adatok 201704 adatok"/>
</operator>
<operator activated="false" class="sample" expanded="true" height="76" name="Sample" width="90" x="179" y="480">
<parameter key="sample_size" value="21000"/>
</operator>
<operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="380" y="390">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="JOVEDELEM|LAKHELY_ERTEK|ISKOLAZOTTSAG_ERTEK|ELETKOR|ALKALMAZOTTI_ERTEK|KULSO_HITEL_OSSZESEN|KULSO_HITEL_DB|CSATORNA"/>
</operator>
<operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="313" y="75">
<list key="columns"/>
</operator>
<operator activated="true" class="k_medoids" expanded="true" height="76" name="Clustering" width="90" x="581" y="255">
<parameter key="k" value="6"/>
<parameter key="max_runs" value="2"/>
<parameter key="max_optimization_steps" value="5"/>
</operator>
<operator activated="true" class="extract_prototypes" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="514" y="75"/>
<connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
<connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
<connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

 

And I have attached an xlsx file with sample data.

Thanks!

 

Tibor

Attachments

Contributor II

Re: Clustering - weird salary values

Hi T-Bone!

 

Thanks for the quick answer! Smiley Happy Here is the script of my process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" expanded="true" name="Process">
<process expanded="true" height="701" width="955">
<operator activated="false" class="sample" expanded="true" height="76" name="Sample" width="90" x="179" y="480">
<parameter key="sample_size" value="21000"/>
</operator>
<operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="390">
<parameter key="repository_entry" value="Data- RapidMiner"/>
</operator>
<operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="313" y="75">
<list key="columns"/>
</operator>
<operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="380" y="390">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Channel|Salary|Accomodation|Education|Age|Employment|Credit vol|Credit (pcs)"/>
</operator>
<operator activated="true" class="k_medoids" expanded="true" height="76" name="Clustering" width="90" x="581" y="255">
<parameter key="k" value="6"/>
<parameter key="max_runs" value="2"/>
<parameter key="max_optimization_steps" value="5"/>
</operator>
<operator activated="true" class="extract_prototypes" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="514" y="75"/>
<connect from_op="Retrieve (2)" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
<connect from_op="Replace Missing Values" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
<connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
<connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

 

And I've attached a CSV file with my data.

Thanks!

 

Tibor

Attachments

Community Manager

Re: Clustering - weird salary values

Ah, you have a problem in your salary data. It comes in as Polynominal data format not Numeric. When I tried to parse the numbers, I see that there are errant "," in them. So I did a FE to filter out those errors and then a Parse #.

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="false" class="sample" compatibility="5.0.000" expanded="true" height="82" name="Sample" width="90" x="179" y="480">
        <parameter key="sample_size" value="21000"/>
        <list key="sample_size_per_class"/>
        <list key="sample_ratio_per_class"/>
        <list key="sample_probability_per_class"/>
      </operator>
      <operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Data - Rapidminer" width="90" x="112" y="34">
        <parameter key="repository_entry" value="../data/Data - Rapidminer"/>
      </operator>
      <operator activated="true" class="replace_missing_values" compatibility="5.0.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="246" y="34">
        <list key="columns"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="5.0.000" expanded="true" height="82" name="Select Attributes" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Channel|Salary|Accomodation|Education|Age|Employment|Credit vol|Credit (pcs)"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.4.000" expanded="true" height="103" name="Filter Examples" width="90" x="514" y="34">
        <parameter key="invert_filter" value="true"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="Salary.contains.,"/>
        </list>
      </operator>
      <operator activated="true" class="parse_numbers" compatibility="7.4.000" expanded="true" height="82" name="Parse Numbers" width="90" x="648" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Salary"/>
      </operator>
      <operator activated="true" class="k_medoids" compatibility="5.0.000" expanded="true" height="82" name="Clustering" width="90" x="782" y="34">
        <parameter key="k" value="6"/>
        <parameter key="max_runs" value="2"/>
        <parameter key="max_optimization_steps" value="5"/>
      </operator>
      <operator activated="true" class="extract_prototypes" compatibility="5.0.000" expanded="true" height="82" name="Extract Cluster Prototypes" width="90" x="916" y="34"/>
      <connect from_op="Retrieve Data - Rapidminer" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Replace Missing Values" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Parse Numbers" to_port="example set input"/>
      <connect from_op="Parse Numbers" from_port="example set output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
      <connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
      <connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>
Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Elite III

Re: Clustering - weird salary values

@tibor_sebok keep in mind that you probably want to add a "normalize" operator to this process as well.  The K-means algorithm family is sensitive to the absolute scale of the attributes, and with mixed measures in particular, you have multiple attributes with very different ranges of raw data.  Currently it will be dominated by the attributes with the highest absolute variance.

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Highlighted
Community Manager

Re: Clustering - weird salary values

Yep @Telcontar120 is right, you might want to use a Normalize operator.  On seperate note, you might want to try a Sample operator. I just realized that crunching through your entire dataset is taking some time. 

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott