Extract data from pdf files and perform text analysis

Studentul_86 · December 2020

Hello,

I'm a recent user of RapidMiner, using the free educational solution, for one academic paper I'm working on.
The problem is I did not found any possibility up-to-now to extract data for text analysis in RapidMiner from pdf files.
Can somebody help me advice me with a process or any advice on how I can extract in RapidMiner text from multiple pdf files at once and reach this way my target of counting words?

Also, related to sentiment analysis of texts, can somebody give me hints on free solutions in RapidMiner to perform?

Thank you.
Best regards,

Valentin.

MartinLiebig · December 2020

Hi,

The Read Document has an option to read pdfs. You want to combine this with a loop files operator.

Best,

Martin

lionelderkrikor · December 2020

Hi Vali,

I'm not sure what you are looking for , thus I propose 2 options based on Martin's idea :

- Process 1 (in attached file) : Read Document inside a Loop Files operator, then a Process Documents operator
- process 2 (in attached file) : Read Document inside a Loop Files operator, then a Combine Documents operator, then a Process Documents operator.

Tell us if one of these processes answers to your request...If not can you elaborate what you want to achieve ?

Regards,

Lionel

lionelderkrikor · December 2020

Hi Vali,

It seems that your Loop Files is not correctly set.
Please import the second process (Loop_read_pdfs_documents.rmp) I shared in my previous post and set in the parameters of the Loop Files operator the path where the PDFs files are stored in your case.

Regards,

Lionel

Studentul_86 · December 2020

Just to add that the read document operator I see allow only csv, xls, url, spss, stata, sparse, arff, xrff, Dbase, c4.5, dasyLab, xml, or access format files. Or I do not know how to find?

Studentul_86 · December 2020

Hello Martin,

Thank you for the advice. I've tried, but unfortunately I see this loop file is used in case I want to concatenate multiple Read Document operators. Which is the solution for me to import at once through a Process Documents operator about 300 pdf files?

Thank you for your support.

Best regards,

Vali.

Studentul_86 · December 2020

Hello Lionel,

I've achieved to created the set of documents uploaded in my RapidMiner process.
However, now I face with a strange situation. All pdf files uploaded on my process do not lead to a word list.

In the files attached you can see the process I've designed, a really simple one. Instead the results show nothing, no word, no list of documents analyzed. What did I do wrong? This process was tested only for 5 pdf files uploaded.

The idea is simple what I need to do with those about 300 hundred pdf files. I want to:
- create a list with the words and their count on the files;
- get the files length (number of words);
- get the correlation between words, for some specific terms;
- get a set of graphical associations for those specific terms;
etc.

Unfortunately I'm stuck on the very beginning of the process. I need your advice or anybody else from this community.

Thank you,

Best regards,
Vali.

kayman · December 2020

Seems you are using the wrong output port if I look at the main process blp. The data is at out 1 (notice the color) while you attach to out 2.

Studentul_86 · December 2020

Hello @kayman,

I've made the change you recommended me. Now it shows me that the loop file is not properly working because there are not enough iterations? What that mean...? PLease help me with what should I still have to change...? Attached you can find the error I'm talking about.

Thank you,
Vali.

kayman · December 2020

That's quite complex to see based on screenshots, your pruning might be to tough, you may loose the content somewhere else etc.

Your current flow works on a single pdf a time, where you most likely need all of these combined to get some decent tfidf results.

Just try to ensure you already get something in the first place. Loop through the pdf's just combine them and see if you get results. Start with a few, combine these and see if you get content in the first place using the combine documents operator.

Then use tfidf on that one, tuning the prune on the go.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Extract data from pdf files and perform text analysis

Best Answers

Answers