trending words

theimantheiman Member Posts: 2 Contributor I

I have a number of documents in a directory.  Each file is a quarterly report. I would like to count the occurrence of certain words in each document and track them over time.  I have done the standard set of things i.e. filtering stop words, stemming, transform case, etc. but am unsure where to go from there..Does anybody have any ideas how I can do this?  Thank you!!




  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi Tom,

    well, there are a lot of options. Let's assume that you could identify the files by date (e.g. from the filename or by other means like a specific keyword used or whatever). In that case, you could use the operator "Process Documents from Files" with the setting "Term Occurrences" for the parameter "Vector Creation". This would deliver one example per file (quarter) with some type of Id defining the time frame this line represents. The other columns contain the number of times each term has occurred in the reports at this point of time.

    Next thing probably would be to transform the timeframe id to a real date. Now you could do several things: For example, you could identify the most common keywords and aggregate their occurences over time. This would result in a couple of series for each selected term which could be plotted for example as report in RapidAnalytics. Or you could simply identify the top terms in each quarter and present those, for example as tag clouds or in similar ways, also with RapidAnalytics. Or...

    It's actually a bit hard to give more concrete suggestions without knowing the goal you want to achieve with such a presentation but maybe this helps.

  • theimantheiman Member Posts: 2 Contributor I
    Hi Ingo,

    Thank you!!


Sign In or Register to comment.