Text extraction of key themes/words from series of pdf files

pimlico35 · November 2021

Hi Folks,

Im new to this & struggling a little bit

I just wanted some easy (explicit) steps to help me achieve what I want to do, which is:

I have a series of mostly pdf reports;

- I want to extract key themes or words that recur throughout the reports, for example 'serious accident' or 'safety'

What I have done so far is to put all these files into a new repository. I have tried to use operators to read through the files, tokenise etc - but Im getting lost in translation so to speak

- Im not sure whether I have to convert the pdfs into word files - if that makes it easier before getting it into rapidminer; but that seems to defeat the whole purpose ....

- I want to then have a document or table of these extracted common occuring words so I can see how often they are used. Later then I can also check in the output document the least used words...

I would really appreciate any help or pointing me in the direction of videos that explicitly look at this.

thanks so much!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Text extraction of key themes/words from series of pdf files