"PCA vs PrincipalComponentGenerator?"

Legacy UserLegacy User Member Posts: 0 Newbie
edited May 2019 in Help
Hi,

From what I could see, experiments ExampleSource-PrincipalComponentsGenerator(1)  and ExampleSource-PCA-ModelApplier(2) generate the same output data sets in the input set contains a label attribute. If the input does not have a label, experiment (1) crashes at runtime, even though it passes validation. In addition, the experiment (2) outputs the PCA model, and has more controls (number of PCs).

If the PCA operator is clearly superior to the PrincipalComponentsGenerator, why do you keep the PrincipalComponentsGenerator? Or does it have any advantages I missed?

Victor

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    you are right. They deliver the same output. There are basically two reasons for keeping the PrincipalComponentsGenerator:

    1. backwards compatibility
    2. only one operator instead of two in cases where you are interested in the PCA only (without the model)

    It is, however, very likely that this operator will be marked as deprecated and will be removed from a future release sometime.

    Cheers,
    Ingo
  • Stefan_EStefan_E Member Posts: 53 Maven
    Hi,

    ... there seems to be another reason: Performance!

    I have a data set with 20 attributes, 5094 examples.
    1. PrincipalComponentsGenerator returns in a matter of a couple of seconds.
    2. PCA takes 2900s so far and is still running with 100% CPU load

    When I put a sampling operator in front of PCA and sample for 70%, I get a result in ~10s - still slower than PrincipalComponentsGenerator, but at least tolerable.

    The dataset is such the PC-1 explains 99.97% of the variance - don't know whether that has any impact.

    Kind regards                                                Stefan
  • Stefan_EStefan_E Member Posts: 53 Maven
    hmm.... my dataset contained a line with missing values.
    Not very elegant of PCA of course to just go to nirwana with such an input, but if I delete that line, it works.

    Kind regards                                    Stefan
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    we will increase the elegance of PCA by throwing an error with the next version. :)

    Thanks for the hint,
      Sebastian
Sign In or Register to comment.