Options

# Extracting SVD vectors for LSA

jacobcybulski
Member, University Professor Posts:

**391**UnicornI am trying to perform latent semantic analysis (LSA) of text using SVD. So I can see the SVD vectors in the resulting SVD model, I can play with them, however, there seems no way of extracting them (and now way of gettimg to Eigenvalues either). I know there was a similar post 5 years ago and the recommended solution was to use R instead. Lots of things must have changed since.

Today I can see three possibilities:

- Somehow split the model into its two matrices and access them as examples - at this stage, I cannot see such an operator;
- Get to the weights of SVD components and extract them one by one - Weight by Component Model looked promising but I could not achieve the required result;
- Save the model and read it back as XML - seemed like an idea but the XML seems very complex and I cannot find the required bits to easily read them in as vectors or examples.

Any ideas to do this in pure RapidMiner - apart from the R solution which at this stage seems the simplest?

0

## Answers

2,959Community Managerhi @jacobcybulski - I don't know if this helps at all but have you looked at @mschmitz's implementation of LDA for topic analysis? It's in the Operator Toolbox extension.

Scott

391UnicornThank you Scott, indeed this is helpful - I have missed this version of LDA but was using an alternative Rmx Corpus Linguistics (Kobra) plugin. The Operator Toolbox LDA seems to work very well. It would still be useful to be able to extract various model components from SVD (and PCA as well). As I teach text analytics with RM, uses of SVD and PCA are beyond simple dimensionality reduction, so getting our hands on singular values, principal components and eigenvectors would be great. Perhaps I need to peer into the SVD and PCA code to see how they write these matrices into the model.

Jacob

3,517RM Data ScientistHi @jacobcybulski,

have a look at converters extension. It has an operator called PCA to ExampleSet which can get you the eigenvector table.

What exactly would you need for SVD? Could you please show a screenshot? I will sit down and either write a quick operator for converters or groovy script to overcome this.

Best,

Martin

Dortmund, Germany

391UnicornSound like a great offer - thanks!

The PCA to ExampleSet is great and this is exactly what's needed for SVD, i.e. if we could get SVD to ExampleSet which would take the SVD model and extract the SVD eigenvalues and SVD vectors, in almost the same way, i.e. the SVD to ExampleSet would produce an example set with one SVD component per row, and the attribute would include: Component, Singular Value, Proportion of Singular Values, Cumulative Singular Values and Cumulative Proportion of Singular Values, followed by attribute SVD vector elements (as is the case with PCA). Note that Cumulative Singular Values and Cumulative Proportion of Singular Values can be obtained from the Singular Value but the cumulative attributes for PCA have also been included.

Once we are able to recover the SVD vectors, these can be used to "approximate" topics.

Thanks -- Jacob

3,517RM Data ScientistHi @jacobcybulski,

to be super sure i don't code the wrong thing, you want to have this table:

and not the SVD Vectors? Or one operator with two outputs with exactly these two tables which are visible in the preprocessing model?

Best,

Martin

Dortmund, Germany

391UnicornIn my previous message, I was describing an output most similar to that obtained from the PCA Result to ExampleSet (thinking that this would be just a dump of the previous code). However, one operator (say SDA Result to ExampleSet) with two tables from the pre-processing model (Eigenvalues and Svd Vectors) would definitely be the cleanest.

Thanks -- Jacob

391UnicornJust one little comment on these two tables. At the moment the SVD vectors in one table are called "SVD Vector N" and in another (in rows) are called "SVD N". I wonder if for consistency and easy merging the two tables (after SVD vectors transpose, if needed) in both tables we could have the same naming convention, say "SVD N"?

Jacob

3,517RM Data ScientistHi @jacobcybulski,

i've drafted a version which is in internal review. i hope i can publish this to the marketplace next week.

Did you have a chance to check the version i've shared earlier?

Best,

Martin

Dortmund, Germany

391UnicornHave I missed something? I may be confused! I have used the version from before our discussion which inclued the PCA extractor but no SVD extractor. Was there some other one around? Was it on GitHub?

3,517RM Data ScientistHi,

i've sent you an email to the mail address you use here in this accout, did it not come through?

Best,

Martin

Dortmund, Germany