Options

Extracting SVD vectors for LSA

jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
edited December 2018 in Help

I am trying to perform latent semantic analysis (LSA) of text using SVD. So I can see the SVD vectors in the resulting SVD model, I can play with them, however, there seems no way of extracting them (and now way of gettimg to Eigenvalues either). I know there was a similar post 5 years ago and the recommended solution was to use R instead. Lots of things must have changed since.

Today I can see three possibilities:

  1. Somehow split the model into its two matrices and access them as examples - at this stage, I cannot see such an operator;
  2. Get to the weights of SVD components and extract them one by one - Weight by Component Model looked promising but I could not achieve the required result;
  3. Save the model and read it back as XML - seemed like an idea but the XML seems very complex and I cannot find the required bits to easily read them in as vectors or examples.

Any ideas to do this in pure RapidMiner - apart from the R solution which at this stage seems the simplest?

Answers

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @jacobcybulski - I don't know if this helps at all but have you looked at @mschmitz's implementation of LDA for topic analysis? It's in the Operator Toolbox extension.

     

    Scott

     

  • Options
    jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn

    Thank you Scott, indeed this is helpful - I have missed this version of LDA but was using an alternative Rmx Corpus Linguistics (Kobra) plugin. The Operator Toolbox LDA seems to work very well. It would still be useful to be able to extract various model components from SVD (and PCA as well). As I teach text analytics with RM, uses of SVD and PCA are beyond simple dimensionality reduction, so getting our hands on singular values, principal components and eigenvectors would be great. Perhaps I need to peer into the SVD and PCA code to see how they write these matrices into the model.

     

    Jacob

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hi @jacobcybulski,

     

    have a look at converters extension. It has an operator called PCA to ExampleSet which can get you the eigenvector table.

     

    What exactly would you need for SVD? Could you please show a screenshot? I will sit down and either write a quick operator for converters or groovy script to overcome this.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn

    Sound like a great offer - thanks!

     

    The PCA to ExampleSet is great and this is exactly what's needed for SVD, i.e. if we could get SVD to ExampleSet which would take the SVD model and extract the SVD eigenvalues and SVD vectors, in almost the same way, i.e. the SVD to ExampleSet would produce an example set with one SVD component per row, and the attribute would include: Component, Singular Value, Proportion of Singular Values, Cumulative Singular Values and Cumulative Proportion of Singular Values, followed by attribute SVD vector elements (as is the case with PCA). Note that Cumulative Singular Values and Cumulative Proportion of Singular Values can be obtained from the Singular Value but the cumulative attributes for PCA have also been included.

     

    Once we are able to recover the SVD vectors, these can be used to "approximate" topics.

     

    Thanks -- Jacob

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hi @jacobcybulski,

    to be super sure i don't code the wrong thing, you want to have this table:SVD2.png

     

     

    and not the SVD Vectors? Or one operator with two outputs with exactly these two tables which are visible in the preprocessing model?

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn

    In my previous message, I was describing an output most similar to that obtained from the PCA Result to ExampleSet (thinking that this would be just a dump of the previous code). However, one operator (say SDA Result to ExampleSet) with two tables from the pre-processing model (Eigenvalues and Svd Vectors) would definitely be the cleanest. 

    Thanks -- Jacob

  • Options
    jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn

    Just one little comment on these two tables. At the moment the SVD vectors in one table are called "SVD Vector N" and in another (in rows) are called "SVD N". I wonder if for consistency and easy merging the two tables (after SVD vectors transpose, if needed) in both tables we could have the same naming convention, say "SVD N"?

    Jacob

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hi @jacobcybulski,

    i've drafted a version which is in internal review. i hope i can publish this to the marketplace next week.

     

    Did you have a chance to check the version i've shared earlier?

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn

    Have I missed something? I may be confused! I have used the version from before our discussion which inclued the PCA extractor but no SVD extractor. Was there some other one around? Was it on GitHub?

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Hi,

     

    i've sent you an email to the mail address you use here in this accout, did it not come through?

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.