RapidMiner 9.7 is Now Available
Lots of amazing new improvements including true version control! Learn more about what's new here.
Transforming output from Process Docs to create a word list/document
We have a challenge to create word/tag clouds from a database system...
Easy I thought, create a table with the first column being Document ID, another column for the word and then a third column as the count of that word in the document (we probably wouldn’t use the 3rd column, but just in case). In this way we could create a very quick word cloud no matter what the user selects as the subset of documents.
So I have set up the job in Rapid Miner, reading the records from the database including only the Document ID and the full text field, passed it through the Process Documents element (tokenise, transform case, filter stop word, filter tokens, stem)... Job done...
Unfortunately no... and here is my problem.
The data that comes out from the Process Document element has the Document ID as the first column, but then every word that is found is the name of the remaining columns... I have looked at Transpose and Pivot, but neither of these do what I need....
We did think about saving the output as CSV and then doing something outside of Rapid Miner, but it would then mean it will be a manual process rather than something I can automate hourly to deal with new records.
Any thoughts or ideas will be most appreciated.