Clustering - weird salary values

tibor_seboktibor_sebok Member Posts: 3 Contributor I
edited November 2018 in Help

Hi guys,

 

I am new to Rapidminer, my first little project is clustering some of our customers. Attributes are like age, salary, job position and the amount of credit they have. In order to have a faster process, I have "translated" nominal data (like employement, channel) to numerical data.

 

I have run the clustering model, and except salary I got seemingly coherent data. 

 

Here is my centroid table:

Attributes cluster_0 cluster_1 cluster_2 cluster_3 cluster_4 cluster_5
Channel 1.0 0.0 1.0 1.0 3.0 0.0
Accomodation 1.0 2.0 2.0 2.0 3.0 1.0
Education 1.0 1.0 1.0 1.0 1.0 1.0
Employement 2.0 2.0 3.0 1.0 1.0 2.0
Credit (pcs) 4.0 1.0 2.0 1.0 2.0 2.0
Age 36.0 40.0 44.0 65.0 78.0 41.0
Salary 3417.0 4174.0 3100.0 79.0 226.0 8601.0
Credit (vol) 1181014.0 0.0 8690185.0 0.0 362750.0 3622658.0

 

What surprised me was the values in the Salary row - they should be much higher than that, the average salary in my data table for these customers are well above 188 k.

My question is: is there somethig I am missing or am I interpreteing the data wrong?

Thanks for the answers!

 

Tibor

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Well yes, something sounds weird but without seeing your data and process it's hard to guess here.


    Question: how did you convert your nominal values into numerical?  I'm assuming you're using K-means. If you select mixed measures you can use the Nominal and Numerical values without modifying them. 

  • tibor_seboktibor_sebok Member Posts: 3 Contributor I

    Hi T-Bone,

     

    First of all, thanks for the quick answer. :) The conversion of nominal data to numerical was inspired by the fact that in my experience it took longer for Rapidminer to run the process, but next time I will try your suggestion.

     

    The xml of my process is down below:

    ?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="701" width="955">
    <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="48" y="422">
    <parameter key="repository_entry" value="Elemzés - RCC KHR adatok 201704 adatok"/>
    </operator>
    <operator activated="false" class="sample" expanded="true" height="76" name="Sample" width="90" x="179" y="480">
    <parameter key="sample_size" value="21000"/>
    </operator>
    <operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="380" y="390">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="JOVEDELEM|LAKHELY_ERTEK|ISKOLAZOTTSAG_ERTEK|ELETKOR|ALKALMAZOTTI_ERTEK|KULSO_HITEL_OSSZESEN|KULSO_HITEL_DB|CSATORNA"/>
    </operator>
    <operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="313" y="75">
    <list key="columns"/>
    </operator>
    <operator activated="true" class="k_medoids" expanded="true" height="76" name="Clustering" width="90" x="581" y="255">
    <parameter key="k" value="6"/>
    <parameter key="max_runs" value="2"/>
    <parameter key="max_optimization_steps" value="5"/>
    </operator>
    <operator activated="true" class="extract_prototypes" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="514" y="75"/>
    <connect from_op="Retrieve" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
    <connect from_op="Replace Missing Values" from_port="example set output" to_op="Clustering" to_port="example set"/>
    <connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
    <connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
    <connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

     

    And I have attached an xlsx file with sample data.

    Thanks!

     

    Tibor

  • tibor_seboktibor_sebok Member Posts: 3 Contributor I

    Hi T-Bone!

     

    Thanks for the quick answer! :) Here is the script of my process:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="701" width="955">
    <operator activated="false" class="sample" expanded="true" height="76" name="Sample" width="90" x="179" y="480">
    <parameter key="sample_size" value="21000"/>
    </operator>
    <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve (2)" width="90" x="45" y="390">
    <parameter key="repository_entry" value="Data- RapidMiner"/>
    </operator>
    <operator activated="true" class="replace_missing_values" expanded="true" height="94" name="Replace Missing Values" width="90" x="313" y="75">
    <list key="columns"/>
    </operator>
    <operator activated="true" class="select_attributes" expanded="true" height="76" name="Select Attributes" width="90" x="380" y="390">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Channel|Salary|Accomodation|Education|Age|Employment|Credit vol|Credit (pcs)"/>
    </operator>
    <operator activated="true" class="k_medoids" expanded="true" height="76" name="Clustering" width="90" x="581" y="255">
    <parameter key="k" value="6"/>
    <parameter key="max_runs" value="2"/>
    <parameter key="max_optimization_steps" value="5"/>
    </operator>
    <operator activated="true" class="extract_prototypes" expanded="true" height="76" name="Extract Cluster Prototypes" width="90" x="514" y="75"/>
    <connect from_op="Retrieve (2)" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
    <connect from_op="Replace Missing Values" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Clustering" to_port="example set"/>
    <connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
    <connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
    <connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

     

    And I've attached a CSV file with my data.

    Thanks!

     

    Tibor

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Ah, you have a problem in your salary data. It comes in as Polynominal data format not Numeric. When I tried to parse the numbers, I see that there are errant "," in them. So I did a FE to filter out those errors and then a Parse #.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="5.0.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="false" class="sample" compatibility="5.0.000" expanded="true" height="82" name="Sample" width="90" x="179" y="480">
    <parameter key="sample_size" value="21000"/>
    <list key="sample_size_per_class"/>
    <list key="sample_ratio_per_class"/>
    <list key="sample_probability_per_class"/>
    </operator>
    <operator activated="true" class="retrieve" compatibility="7.4.000" expanded="true" height="68" name="Retrieve Data - Rapidminer" width="90" x="112" y="34">
    <parameter key="repository_entry" value="../data/Data - Rapidminer"/>
    </operator>
    <operator activated="true" class="replace_missing_values" compatibility="5.0.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="246" y="34">
    <list key="columns"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="5.0.000" expanded="true" height="82" name="Select Attributes" width="90" x="380" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Channel|Salary|Accomodation|Education|Age|Employment|Credit vol|Credit (pcs)"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.4.000" expanded="true" height="103" name="Filter Examples" width="90" x="514" y="34">
    <parameter key="invert_filter" value="true"/>
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Salary.contains.,"/>
    </list>
    </operator>
    <operator activated="true" class="parse_numbers" compatibility="7.4.000" expanded="true" height="82" name="Parse Numbers" width="90" x="648" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Salary"/>
    </operator>
    <operator activated="true" class="k_medoids" compatibility="5.0.000" expanded="true" height="82" name="Clustering" width="90" x="782" y="34">
    <parameter key="k" value="6"/>
    <parameter key="max_runs" value="2"/>
    <parameter key="max_optimization_steps" value="5"/>
    </operator>
    <operator activated="true" class="extract_prototypes" compatibility="5.0.000" expanded="true" height="82" name="Extract Cluster Prototypes" width="90" x="916" y="34"/>
    <connect from_op="Retrieve Data - Rapidminer" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
    <connect from_op="Replace Missing Values" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Parse Numbers" to_port="example set input"/>
    <connect from_op="Parse Numbers" from_port="example set output" to_op="Clustering" to_port="example set"/>
    <connect from_op="Clustering" from_port="cluster model" to_op="Extract Cluster Prototypes" to_port="model"/>
    <connect from_op="Extract Cluster Prototypes" from_port="example set" to_port="result 1"/>
    <connect from_op="Extract Cluster Prototypes" from_port="model" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    @tibor_sebok keep in mind that you probably want to add a "normalize" operator to this process as well.  The K-means algorithm family is sensitive to the absolute scale of the attributes, and with mixed measures in particular, you have multiple attributes with very different ranges of raw data.  Currently it will be dominated by the attributes with the highest absolute variance.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Yep @Telcontar120 is right, you might want to use a Normalize operator.  On seperate note, you might want to try a Sample operator. I just realized that crunching through your entire dataset is taking some time. 

Sign In or Register to comment.