how can ı affect on some attributes?because some of attributes more important than others

Selim · April 2019

ı did clustering(k-mean) in my process and now ı want to affect on some attributes .ı used to generate attribute but it is looking lıke doesnt affect to attributes.so what ı need to do ? which way or operator gonna make it work ?
ı shared xml
thanks
-------------------

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">

</context>

</list>

</operator>

</operator>

</operator>

</operator>

</list>

</operator>

</operator>

</operator>

</operator>

</list>

</operator>

<parameter key="script" value="import pandas as pd
from operator import itemgetter
import numpy as np
import random
import sys
from scipy.spatial import distance
from sklearn.cluster import KMeans


C = %{cluster_number}

def k_means(X) : 

 kmeans = KMeans(n_clusters=C, random_state=0).fit(X)
 return kmeans.cluster_centers_




def samesizecluster( D ):
 """ in: point-to-cluster-centre distances D, Npt x C
 
 out: xtoc, X -> C, equal-size clusters
 
 """
 
 Npt, C = D.shape
 clustersize = (Npt + C - 1) // C
 xcd = list( np.ndenumerate(D) ) # ((0,0), d00), ((0,1), d01) ...
 xcd.sort( key=itemgetter(1) )
 xtoc = np.ones( Npt, int ) * -1
 nincluster = np.zeros( C, int )
 nall = 0
 for (x,c), d in xcd:
 if xtoc[x] < 0 and nincluster[c] < clustersize:
 xtoc[x] = c
 nincluster[c] += 1
 nall += 1
 if nall >= Npt: break
 return xtoc

def rm_main(data):
 
 data_2 = data.values
 
 centres = k_means(data_2)
 D = distance.cdist( data_2, centres )
 xtoc = samesizecluster( D )
 data['cluster'] = xtoc

 
 return data"/>

</operator>

</operator>

</operator>

</operator>

</process>

</operator>

</process>

yyhuang · April 2019

Hi @Selim,

Thanks for sharing your process with integrated python scripts.

I saw that you are clustering some data with k-means. How about some normalization on the attributes. For important attributes, we can use range normalization method to make the normalized data in the new range, e.g. [0, 3]. But everything else would be normalized into [0 , 1].

Since k-means can be implemented with Euclidean distance, the normalization would change the calculated distance based on the interference.

Image: https://us.v-cdn.net/6030995/uploads/editor/nw/o77omv6qhfch.png

Take the iris data as example. We put much higher weight on a3 than other attributes a1, a2, a4.
Here is the normalization process before clustering.

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process" origin="GENERATED_TUTORIAL">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Iris" origin="GENERATED_TUTORIAL" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/data/Iris"/>
        <description align="center" color="blue" colored="true" width="126">The Iris data set is retrieved from the Samples folder.&lt;br/&gt;The label Attribute remains in the ExampleSet for comparison the results of the Clustering. It is not used in the Clustering itself.</description>
      </operator>
      <operator activated="true" class="normalize" compatibility="9.2.001" expanded="true" height="103" name="Normalize" width="90" x="179" y="34">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value="a1|a2|a4"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="method" value="range transformation"/>
        <parameter key="min" value="0.0"/>
        <parameter key="max" value="1.0"/>
        <parameter key="allow_negative_values" value="false"/>
      </operator>
      <operator activated="true" class="normalize" compatibility="9.2.001" expanded="true" height="103" name="Normalize (2)" width="90" x="313" y="34">
        <parameter key="return_preprocessing_model" value="false"/>
        <parameter key="create_view" value="false"/>
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="a3"/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="real"/>
        <parameter key="block_type" value="value_series"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_series_end"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="method" value="range transformation"/>
        <parameter key="min" value="0.0"/>
        <parameter key="max" value="3.0"/>
        <parameter key="allow_negative_values" value="false"/>
      </operator>
      <operator activated="true" class="concurrency:k_means" compatibility="9.0.001" expanded="true" height="82" name="Clustering" origin="GENERATED_TUTORIAL" width="90" x="447" y="34">
        <parameter key="add_cluster_attribute" value="true"/>
        <parameter key="add_as_label" value="false"/>
        <parameter key="remove_unlabeled" value="false"/>
        <parameter key="k" value="3"/>
        <parameter key="max_runs" value="10"/>
        <parameter key="determine_good_start_values" value="false"/>
        <parameter key="measure_types" value="BregmanDivergences"/>
        <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
        <parameter key="nominal_measure" value="NominalDistance"/>
        <parameter key="numerical_measure" value="EuclideanDistance"/>
        <parameter key="divergence" value="SquaredEuclideanDistance"/>
        <parameter key="kernel_type" value="radial"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_degree" value="3.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
        <parameter key="max_optimization_steps" value="100"/>
        <parameter key="use_local_random_seed" value="true"/>
        <parameter key="local_random_seed" value="1992"/>
        <description align="center" color="green" colored="true" width="126">The k-Means algorithm is used to determine three clusters on the Iris data set and assign each Example to one cluster.</description>
      </operator>
      <connect from_op="Retrieve Iris" from_port="output" to_op="Normalize" to_port="example set input"/>
      <connect from_op="Normalize" from_port="example set output" to_op="Normalize (2)" to_port="example set input"/>
      <connect from_op="Normalize (2)" from_port="example set output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="clustered set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <description align="left" color="yellow" colored="false" height="281" resized="true" width="788" x="24" y="371">Look into the results of the process:&lt;br&gt;ExampleSet (Rename):&lt;br&gt;- cluster_0 consist mainly of iris_virginica Examples (36) with only a few (3) iris_versicolor Examples&lt;br&gt;- cluster_1 consists completely of iris_setosa Examples (50). Also iris_setosa Example cannot be found in other clusters.&lt;br&gt;- cluster_2 consists most of iris_versicolor Examples (47) but with also some (14) iris_virginica Examples&lt;br&gt;&lt;br&gt;ExampleSet (Clustering):&lt;br&gt;- You can visualize the assignment of the Examples to the clusters by using the 'Scatter' Chart, plotting two of the Attributes a1,a2,a3,a4 on x-and y-axis and the cluster Attribute as Color Column&lt;br&gt;&lt;br&gt;Cluster Model (Clustering):&lt;br&gt;- The Cluster Model consist information which Example is assigned to which cluster&lt;br/&gt;- the size of the clusters can be visualized as a graph&lt;br/&gt;- the position of the centroids is listed</description>
    </process>
  </operator>
</process>

Best,

YY

Selim · April 2019

when i weight to attributes with generate attributes it is working actually.can you check it ? I think this two way is suitable to solve this sitution.

yyhuang · April 2019

Yes. The generate attributes applied a factor to increase or reduce the impact of the attributes. It would have the similar effect.

Selim · April 2019

@yyhuangwhen we apply weighting in the normal problems sum of coefficients equaling to "1" but when ı apply lıke this sth in that process there is not change anything but when ı put "10" to hacim as coefficient."hacim" has changed ı added to photos of the changing.what is your idea about this changing ?

Selim · April 2019

and when ı apply it with your way(normalize,normalize(2)) "ağırlık" attribute datas being 0.001 why its beeing lıke this ? ı think there is difference between this two different weighting.cuz it is clustering differently to Items. but which one is better or worse we need to try understand to this one.so do you have any offer ?
my last xml pls check to coefficients
----------------------------------

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">

</context>

</list>

</operator>

</operator>

</operator>

</operator>

</list>

</operator>

</operator>

</operator>

</operator>

</list>

</operator>

<parameter key="script" value="import pandas as pd
from operator import itemgetter
import numpy as np
import random
import sys
from scipy.spatial import distance
from sklearn.cluster import KMeans


C = %{cluster_number}

def k_means(X) : 

 kmeans = KMeans(n_clusters=C, random_state=0).fit(X)
 return kmeans.cluster_centers_




def samesizecluster( D ):
 """ in: point-to-cluster-centre distances D, Npt x C
 
 out: xtoc, X -> C, equal-size clusters
 
 """
 
 Npt, C = D.shape
 clustersize = (Npt + C - 1) // C
 xcd = list( np.ndenumerate(D) ) # ((0,0), d00), ((0,1), d01) ...
 xcd.sort( key=itemgetter(1) )
 xtoc = np.ones( Npt, int ) * -1
 nincluster = np.zeros( C, int )
 nall = 0
 for (x,c), d in xcd:
 if xtoc[x] < 0 and nincluster[c] < clustersize:
 xtoc[x] = c
 nincluster[c] += 1
 nall += 1
 if nall >= Npt: break
 return xtoc

def rm_main(data):
 
 data_2 = data.values
 
 centres = k_means(data_2)
 D = distance.cdist( data_2, centres )
 xtoc = samesizecluster( D )
 data['cluster'] = xtoc

 
 return data"/>

</operator>

</operator>

</operator>

</operator>

</process>

</operator>

</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

how can ı affect on some attributes?because some of attributes more important than others

Best Answer

Answers