PCA (kernel) RM vs Python : Differents results

lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
edited December 2018 in Product Feedback - Resolved

Hi,

 

Sorry in advance if I did a mistake, but I discovered significant differences, between RapidMiner and Python, in the calculation of kpc_i by PCA (kernel).

1. But first, why in PCA (kernel) there is not , like the "classic" PCA operator : 

 - in the parameters, the parameter dimensionnality reduction ?

 - in the results, the the eigenvectors and eigenvalues tables results (with standard deviation, proportion of variance etc .).

How exploit, in practice, this operator ?

 

2. Like said above, there is several orders of magnitudes in the calculation of kpc_i  (i use for calculation a kernel = "polynomial" and degree = "3"): 

RM : kpc_i ~10e12 / Python : kpc_i ~10e5

After research, it seems that kpc_i = eigenvectors x sqrt(eigenvalues). It seems that maybe RM don't take the sqrt in account.

You can find the process here, and the dataset in attached file : 

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Read Excel" width="90" x="112" y="34">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\12_Feature_12.2_cereals-PCA.xlsx"/>
<parameter key="imported_cell_range" value="A1:P78"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="name.true.polynominal.attribute"/>
<parameter key="1" value="mfr.true.polynominal.attribute"/>
<parameter key="2" value="type.true.polynominal.attribute"/>
<parameter key="3" value="calories.true.integer.attribute"/>
<parameter key="4" value="protein.true.integer.attribute"/>
<parameter key="5" value="fat.true.integer.attribute"/>
<parameter key="6" value="sodium.true.integer.attribute"/>
<parameter key="7" value="fiber.true.numeric.attribute"/>
<parameter key="8" value="carbo.true.numeric.attribute"/>
<parameter key="9" value="sugars.true.integer.attribute"/>
<parameter key="10" value="potass.true.integer.attribute"/>
<parameter key="11" value="vitamins.true.integer.attribute"/>
<parameter key="12" value="shelf.true.integer.attribute"/>
<parameter key="13" value="weight.true.numeric.attribute"/>
<parameter key="14" value="cups.true.numeric.attribute"/>
<parameter key="15" value="rating.true.real.attribute"/>
</list>
</operator>
<operator activated="true" class="set_role" compatibility="8.0.001" expanded="true" height="82" name="Set Role" width="90" x="246" y="34">
<parameter key="attribute_name" value="name"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles">
<parameter key="mfr" value="id"/>
<parameter key="type" value="id"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="8.0.001" expanded="true" height="82" name="Select Attributes" width="90" x="380" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="name|mfr|type"/>
<parameter key="invert_selection" value="true"/>
</operator>
<operator activated="true" class="principal_component_analysis_kernel" compatibility="8.0.001" expanded="true" height="103" name="PCA (Kernel)" width="90" x="514" y="34">
<parameter key="kernel_type" value="polynomial"/>
</operator>
<operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Read Excel (2)" width="90" x="112" y="391">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Predictive_Analytics_and_Data_Mining\Dec 15 2014\12_Feature_12.2_cereals-PCA.xlsx"/>
<parameter key="imported_cell_range" value="A1:P78"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="name.true.polynominal.attribute"/>
<parameter key="1" value="mfr.true.polynominal.attribute"/>
<parameter key="2" value="type.true.polynominal.attribute"/>
<parameter key="3" value="calories.true.integer.attribute"/>
<parameter key="4" value="protein.true.integer.attribute"/>
<parameter key="5" value="fat.true.integer.attribute"/>
<parameter key="6" value="sodium.true.integer.attribute"/>
<parameter key="7" value="fiber.true.numeric.attribute"/>
<parameter key="8" value="carbo.true.numeric.attribute"/>
<parameter key="9" value="sugars.true.integer.attribute"/>
<parameter key="10" value="potass.true.integer.attribute"/>
<parameter key="11" value="vitamins.true.integer.attribute"/>
<parameter key="12" value="shelf.true.integer.attribute"/>
<parameter key="13" value="weight.true.numeric.attribute"/>
<parameter key="14" value="cups.true.numeric.attribute"/>
<parameter key="15" value="rating.true.real.attribute"/>
</list>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="124" name="PCA Kernel Python" width="90" x="313" y="391">
<parameter key="script" value="import pandas as pd&#10;from sklearn.decomposition import KernelPCA&#10;&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10; &#10; X = data.iloc[:,3:]&#10; attribute = list(X)&#10; kpca = KernelPCA(n_components = 13, kernel = 'poly',degree = 3)&#10; &#10; #Calculation of kpca_i&#10; k_PCA = kpca.fit_transform(X)&#10;&#10; #Calculation of eigenvalues&#10; eigen_values = kpca.lambdas_&#10; &#10; #Calculation of eigenvectors&#10; eigen_vectors = kpca.alphas_&#10;&#10; #Writing of results in datatables&#10; K_PCA = pd.DataFrame(data = k_PCA, columns = ['kpc_1','kpc_2','kpc_3','kpc_4','kpc_5','kpc_6','kpc_7','kpc_8','kpc_9','kpc_10','kpc_11','kpc_12','kpc_13'])&#10; &#10; components = pd.DataFrame(data = ['PC 1','PC 2','PC 3','PC 4','PC 5','PC 6','PC 7','PC 8','PC 9','PC 10','PC 11','PC 12','PC 13'],columns = ['Components'])&#10; eigenvalues = pd.DataFrame(data = eigen_values, columns = ['Eigenvalues'])&#10; components = components.join(eigenvalues)&#10;&#10; attributes = pd.DataFrame(data = attribute,columns = ['Attribute'])&#10; eigenvectors = pd.DataFrame(data = eigen_vectors, columns = ['PC 1','PC 2','PC 3','PC 4','PC 5','PC 6','PC 7','PC 8','PC 9','PC 10','PC 11','PC 12','PC 13'])&#10; attributes = attributes.join(eigenvectors) &#10;&#10; # connect 2 output ports to see the results&#10; return K_PCA,components,attributes"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="PCA (Kernel)" to_port="example set input"/>
<connect from_op="PCA (Kernel)" from_port="example set output" to_port="result 1"/>
<connect from_op="PCA (Kernel)" from_port="preprocessing model" to_port="result 5"/>
<connect from_op="Read Excel (2)" from_port="output" to_op="PCA Kernel Python" to_port="input 1"/>
<connect from_op="PCA Kernel Python" from_port="output 1" to_port="result 2"/>
<connect from_op="PCA Kernel Python" from_port="output 2" to_port="result 3"/>
<connect from_op="PCA Kernel Python" from_port="output 3" to_port="result 4"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
</process>
</operator>
</process>

Can you enlighten me about these subjects ?

 

Thanks you,

 

Best regards,

 

Lionel

 

Tagged:
0
0 votes

Fixed and Released · Last Updated

9.5.0 DC-378

Comments

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Hi Lionel,

     

    I checked your results and I've noticed that with the Kernel PCA operator the number of principal components is 77 (equal to the number of examples)! I also tried the tutorial process on Kernel PCA and it goes from 5 attributes to 200 PCs (again 200 examples). Furthermore, all PCs have the same variance (I calculated it with the Covariance Matrix operator). This is surely incorrect. 

     

    It pains me to say this, but I would use the python script for your task.

     

    Best,

    Sebastian

     

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi Sebastian,

     

    Thanks you for your feedback and your analysis.

     

    Best regards,

     

    Lionel.

     

    NB : I suppose that there will be a fix in a next release of RapidMiner ?

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    moving to Product Feedback.

     

    Scott

     

     

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    I have already forwarded the problem to develoment. Can you confirm my observations on your end?

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
  • GottfriedGottfried Member Posts: 17 Maven

    I noticed the same issue. The result of PCA (Kernel) has always the same number of principal components equal to the number of records in the example set. This is certainly a bug.  Please let me know when this gets fixed ?.

Sign In or Register to comment.