Text extraction of key themes/words from series of pdf files
Im new to this & struggling a little bit
I just wanted some easy (explicit) steps to help me achieve what I want to do, which is:
I have a series of mostly pdf reports;
- I want to extract key themes or words that recur throughout the reports, for example 'serious accident' or 'safety'
What I have done so far is to put all these files into a new repository. I have tried to use operators to read through the files, tokenise etc - but Im getting lost in translation so to speak
- Im not sure whether I have to convert the pdfs into word files - if that makes it easier before getting it into rapidminer; but that seems to defeat the whole purpose ....
- I want to then have a document or table of these extracted common occuring words so I can see how often they are used. Later then I can also check in the output document the least used words...
I would really appreciate any help or pointing me in the direction of videos that explicitly look at this.
thanks so much!