Topic Modeling for PDF files
I want to read several PDF files (business reports) and analyze them. Until now I use the operator Read Douments, because I haven't found a better operator yet.
I want to do a topic modeling on the files to find out relevant topics. A pre-processing is done by the operators Tokenize, Transform Cases, Filter Stopwords, Filter Tokens by Length and Stem. For this I have found the two operators: Extract Topics from Documents (LDA) and Extract Topics from Data (LDA). Unfortunately both do not work properly.
Extract Topics from Documents( LDA) needs a collection as input and I don't know how to get it.
And Extract Topics from Data (LDA) needs a text attribute and again I don't know how to get it.
Accordingly, I have these two questions:
1) Is there an operator I can use to read in multiple PDF files?
2) What is the best operator for Topic Modeling and how do I implement it?
I have created the process below, it runs, but I only get null values as results. Does anyone have a tip for me?
Many thanks for the help
MartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,345 RM Data ScientistHi,likely the texts are for some reasons empty?BR,Martin- Head of Data Science Services at RapidMiner -