WordList (Process Documents from Data): word count

mlubiczmlubicz Member, University Professor Posts: 17 University Professor
Using Process Documents from Data operator we get - as Wordlist - a table with: the list of words with Total occurences and Document Occurences.
However we also get - in a sample process "Applying a Model to categorize Documents (under RM Academy) additional columns for classes/categories, in the above mentioned process 2 columns named unknown and food/beverage/hospitality.
When you use Wordlist to Data the columns are labelled with: inclass (unknown) etc.
I get all zero values in both columns, no matter which vector creation method I use ( I use Term Occurences). What shall be changed to get the words counted for both classes.
Thank you.


  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
  • Options
    Knut-RMKnut-RM Administrator, Employee, Member, University Professor Posts: 113 Administrator
    Hi @mlubicz,
    I have been running the process your were referring to - assuming this is the one - I haven't been able to reproduce the issue. Can you share your process or send a screenshot? See details on how to do this here: https://community.rapidminer.com/discussion/37047

    Did you watch the related video? https://academy.rapidminer.com/learn/video/applying-a-model-to-categorize-documents
    Thanks, Knut
  • Options
    Knut-RMKnut-RM Administrator, Employee, Member, University Professor Posts: 113 Administrator
    Hi @mlubicz
    I finally found the time to look into it. The "0" values are caused by the "Extract content" operator in "Process Documents from Data". Go into the Parameters of that operator and untick the first entry called "extract content". If you do that and run the process again then you will see that the columns get populated and show you the total occurrence for each of the two classes ("unkown" and "food/beverage..."). That output could be used for example to generate a custom pruning mask to reduce the data of the class which is not of interest but I guess there are also other creative options.
    You are now probably wondering why the extract content operator is causing the empty values and my answer is: I don't know. But without having more details I'd say it feels like a bug to me so I will send this to our developers. Hope this helps!
    Cheers, Knut
Sign In or Register to comment.