Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Data out selected by PCA
Hello,
I have configured the process here with the calculation via TF IDF. If I start the process, would the process actually give me the output as the Data_out_selected_by_pca or?
I visualized the whole thing again briefly (see screen)
BR,
Flixport
(Screen)0
Best Answers
-
varunm1 Member Posts: 1,207 UnicornHello @Flixport
Did you drag and drop that data file in "inp"? if so, it will use that as input to your process.
You can see that in the XML code. The code will show you the location in the repository where it is accessing data from.Regards,
Varun
https://www.varunmandalapu.com/Be Safe. Follow precautions and Maintain Social Distancing
5 -
varunm1 Member Posts: 1,207 UnicornSorry, I am a bit confused. Can you post your XML code here?Regards,
Varun
https://www.varunmandalapu.com/Be Safe. Follow precautions and Maintain Social Distancing
5 -
varunm1 Member Posts: 1,207 UnicornI check your process and found the data set your trying to apply does not have numerical attributes. This is the reason it is throwing an error "zero columns found for correlation matrix". PCA operator works only on numerical attributes which is the reason it is throwing this error. There are no numerical attributes in your dataset.
@kayman or @yyhuang or @lionelderkrikor might help you with your question quoted below.I wanted to extract the most important words from the datasets, is there a different approach?Sorry, I am not an expert in text miningRegards,
Varun
https://www.varunmandalapu.com/Be Safe. Follow precautions and Maintain Social Distancing
5 -
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi all,
There is a sample process which select the "most important words" in the "Community Samples" :
Hope this helps,
Regards,
Lionel5 -
yyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data ScientistNo, you will not have nominal attributes after vectorization.
Can you take a look at the process above? My feature engineering subprocess works fine for PCA
6 -
MartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data ScientistHi @Flixport ,by definition PCA can only work on numericals. That's just part of the algorithm. If you need to use PCA then you need to find a way to convert the strings into numericals, i.e via TF-IDF or Nominal to Numerical.Best,Martin- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany5
Answers
No, I did not insert anything via drag & drop. Is that necessary? I ask because I did not insert anything at Chi Square. Of course, my input data is already ready as a CSV file, the attribute values are also numeric, so I can not understand that unfortunately
BR
I suggest you add an operator for text vectorization. Otherwise the text data is not vectorized into TF-IDF vectors.
I used the Reuters data reut2-000, and added vectorization before feature selections. After text vecterization, we have almost 1000 attributes for the keywords, with weight by PCA and feature selection, we kept 50 attributes.
I have sample process for text classification using 20k+ Reuters news, PM me if you need.
Here is the process fixed for feature selection. Enjoy!