Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Answers
as usual you cannot say how things will work before you tested it. One mostly applies the X-Validation to get a performance estimation if it's a classification or regression task. Clusterings might be evaluated using cluster measures. You might optimize the number of dimensions by iterating over this parameter and test every combination. There are operators for this, all starting with Optimize Parameters.
But anyway, I doubt that the SVD will work very well on Text Datasets, simply because it might take much to long time to compute the Singular Value Decomposition of such a huge matrix, as they frequently occur in text mining.
Greetings,
Sebastian
There seems to be a lot of literature about the use of SVD with text, but indeed the time might be prohibitive. Is there a way in RM to get the singular values themselves (I have read one can plot their squares and see where they level off to determine the best # of dimensions)?
Thanks!
I'm not sure about this, but aren't they displayed in the model of the Singular Value Decomposition?
Greetings,
Sebastian
as I saw now, all this information is discarded in RapidMiner and hence currently cannot be shown in the visualization of the preprocessing model. If you take a look at the result of the PCA, it's in deed possible to use these values for displaying. I could add that relative easily I guess, but this and next week, there won't be the time for it. I'm very busy with customer projects, that bye the way are about text mining. So I'm curious about the runtime of SVD with many features and the amount of memory it needs. As far as I can see, it needs the complete matrix, so it will crash with my around 40.000 word attributes. Did you made any experience with that? What about the classification performance, is it worth implementing a special SVD for sparse matrices?
Greetings,
Sebastian
regarding your question: "is it worth implementing a special SVD for sparse matrices?"
I think absolutely YES. it worth for sure. A couple of days ago I tested the SVD operator on my text dataset with 23000 features on a relatively high performance machine. After 10 hours the algorithm was finished!!!
As far as I know, LSA algorithm has tackled with this problem. It just use an approximation of term-document matrix (which is so smaller than the original matrix). So, I kindly suggest that RM team try to embed an LSA operator in RM.
cheers.