Options

# "PCA as with SPSS"

Hi there,

I am completely new to RapidMiner and quite new to statistics in general. Up to now I only worked a little bit with SPSS.

Now in RapidMiner I wanted to repeat the things I've already done with SPSS, e.g. a Principal Component Analysis.

But I have no idea, which operators and parameters are necessary to create the kind of output I am used to when working with SPSS:

Uploaded with ImageShack.us

There I've had a number of variables (Exx and Oxx) and set a fixed number (2) of components. The SPSS output is a table showing the factor loadings for all of my original variables.

After then I usually would evaluate the loadings, group the variables according to their factor loadings and drop variables with too low values.

It seems that RapidMiner's PCA operator is doing something similar but I have no clue, which variables the PC's are generated from or how the PCs are computed.

I hope my explanation isn't too confusing. Maybe RapidMiner simply doesn't offer PCA this way.

Is there anybody out there who can help me?

Best regards,

ron

I am completely new to RapidMiner and quite new to statistics in general. Up to now I only worked a little bit with SPSS.

Now in RapidMiner I wanted to repeat the things I've already done with SPSS, e.g. a Principal Component Analysis.

But I have no idea, which operators and parameters are necessary to create the kind of output I am used to when working with SPSS:

Uploaded with ImageShack.us

There I've had a number of variables (Exx and Oxx) and set a fixed number (2) of components. The SPSS output is a table showing the factor loadings for all of my original variables.

After then I usually would evaluate the loadings, group the variables according to their factor loadings and drop variables with too low values.

It seems that RapidMiner's PCA operator is doing something similar but I have no clue, which variables the PC's are generated from or how the PCs are computed.

I hope my explanation isn't too confusing. Maybe RapidMiner simply doesn't offer PCA this way.

Is there anybody out there who can help me?

Best regards,

ron

0

## Answers

2,531UnicornRapidMiner focuses on automatic data processing and hence there's no optimized user interface for applying just a single pca and then taking a look at the results to manually decide which attributes/variables to keep.

But of course it's still possible. Let's go through this step by step. In the following process I added an Generate Data operator to simply generate some data where we can apply the PCA on. Then I added a Principal Component Analysis operator. The output object is the data that is compressed on the two resulting components and the model itself. The model contains the principal components with the factors as shown in SPSS. To see the results, you need to execute the process. You can now take a look at the factors and use a Select Attribute operator to select which attributes you want to keep manually. But as I said with RapidMiner we prefer doing it automatically. So we can use the "Weight by Component Model" operator to transform one of the components into a weighting vector. Then we can use the "Select by Weight" operator to select only attributes of the original data set that fulfill a given condition. For example we can only use the first k attributes. I hope that will help you.

Greetings,

Sebastian

2Contributor Ithank you very much for your hints. I really appreciate your help. Indeed, working with RapidMiner seems quite different from the SPSS way.

But I still don't realise, which of the original attributes the new factors consist of and how the new factors are computed.

In your first XML example the dimensionality is reduced by variance with a threshold of 0.95. In the results I can look at the PCA model. In the "Eigenvalues" view I can see five factors (PC 1 to PC 5) an their proportional and cumulative variance. The "Eigenvectors" view shows the five original attributes and their PC1 to PC5 factor loadings. That's plausible so far.

But looking at the PCA example set results, suddenly there are just 4 attributes (entitled pc_1 to pc_4). I wonder how these four attributes are generated.

I suppose I still don't understand what are the arithmetic steps RapidMiner is doing.

Maybe you have some more hints for me.

2,531Unicornwell, after you dropped every Principal Component that exceeds the 95% of variance you wanted to keep, there are just remaining the first 4.

Anyway the model keeps all of them.

You mean which matrix operations are performed in the background? You really want to know that? It are just some standard calculations, I doubt SPSS will show them to you?

Greetings,

Sebastian

5Contributor IIJust a thought...