The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

"PCA vs PrincipalComponentGenerator?"

Legacy UserLegacy User Member Posts: 0 Newbie
edited May 2019 in Help

From what I could see, experiments ExampleSource-PrincipalComponentsGenerator(1)  and ExampleSource-PCA-ModelApplier(2) generate the same output data sets in the input set contains a label attribute. If the input does not have a label, experiment (1) crashes at runtime, even though it passes validation. In addition, the experiment (2) outputs the PCA model, and has more controls (number of PCs).

If the PCA operator is clearly superior to the PrincipalComponentsGenerator, why do you keep the PrincipalComponentsGenerator? Or does it have any advantages I missed?



  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    you are right. They deliver the same output. There are basically two reasons for keeping the PrincipalComponentsGenerator:

    1. backwards compatibility
    2. only one operator instead of two in cases where you are interested in the PCA only (without the model)

    It is, however, very likely that this operator will be marked as deprecated and will be removed from a future release sometime.

  • Options
    Stefan_EStefan_E Member Posts: 53 Maven

    ... there seems to be another reason: Performance!

    I have a data set with 20 attributes, 5094 examples.
    1. PrincipalComponentsGenerator returns in a matter of a couple of seconds.
    2. PCA takes 2900s so far and is still running with 100% CPU load

    When I put a sampling operator in front of PCA and sample for 70%, I get a result in ~10s - still slower than PrincipalComponentsGenerator, but at least tolerable.

    The dataset is such the PC-1 explains 99.97% of the variance - don't know whether that has any impact.

    Kind regards                                                Stefan
  • Options
    Stefan_EStefan_E Member Posts: 53 Maven
    hmm.... my dataset contained a line with missing values.
    Not very elegant of PCA of course to just go to nirwana with such an input, but if I delete that line, it works.

    Kind regards                                    Stefan
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    we will increase the elegance of PCA by throwing an error with the next version. :)

    Thanks for the hint,
Sign In or Register to comment.