Scan index files of books for important terms

Legacy UserLegacy User Member Posts: 0 Newbie
Hi there!

I'm not sure if this is the right forum to post this problem, but I hope you guys can help me.

The scenario is: We have a lot of index-files in RTF-format like the glossaries at the end of an academic book.
We want to analyze which words and expressions occur the most and as such are the most important in this field of study.

I know that it is easy with rapidMinder to count all tokens in these files, but often the expressions are a combination of two or even more words which you can only detect if you look at the text layout, like:

user 154-167
    behaviour 178-190
    goal 32-38

You get what I mean? I'm not sure if this problem is solvable with rapidMiner and in particular not HOW. Can you help me with some advice either on rapidMiner or another tool which can help me with that?

Thank you very much!


  • Options
    Legacy UserLegacy User Member Posts: 0 Newbie
    No idea for this? Anyone?
    Thought this should be possible with RapidMiner...  :-\
  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    it is - but does this help you? We have done something very similar to this and it involved a heady load of information extraction from the structured file information which can be really a pain if layout information is high. So if you want me to actually show you an out-of-the-box process doing this: I have somewhere a price tag sticked to my back  ;)

    Seriously, this might turn out to be a hard task - depending on the set of files you are analyzing and how different they are. You can actually learn those dependencies (we had a masters thesis about that at my former department) but this quickly can become a multi-month project. So if you are interested (we certainly are) please contact Rapid-I directly.

    Sorry for not having better news,
Sign In or Register to comment.