Options

Extract data from pdf files and perform text analysis

Studentul_86Studentul_86 Member Posts: 11 Learner I
Hello,

I'm a recent user of RapidMiner, using the free educational solution, for one academic paper I'm working on.
The problem is I did not found any possibility up-to-now to extract data for text analysis in RapidMiner from pdf files.
Can somebody help me advice me with a process or any advice on how I can extract in RapidMiner text from multiple pdf files at once and reach this way my target of counting words?

Also, related to sentiment analysis of texts, can somebody give me hints on free solutions in RapidMiner to perform?

Thank you.
Best regards,

Valentin.

Best Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,509 RM Data Scientist
    Solution Accepted
    Hi,
    The Read Document has an option to read pdfs. You want to combine this with a loop files operator.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Solution Accepted
    Hi Vali,

    I'm not sure what you are looking for , thus I propose 2 options based on Martin's idea : 

     - Process 1 (in attached file) : Read Document inside a Loop Files operator, then a Process Documents operator
     - process 2 (in attached file)  : Read Document inside a Loop Files operator, then a Combine Documents operator, then a Process Documents operator.

    Tell us if one of these processes answers to your request...If not can you elaborate what you want to achieve ?

    Regards,

    Lionel
  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Solution Accepted
    Hi Vali,

    It seems that your Loop Files is not correctly set.
    Please import the second process (Loop_read_pdfs_documents.rmp) I shared in my previous post and set in the parameters of the Loop Files operator the path where the PDFs files are stored in your case.

    Regards,

    Lionel

Answers

  • Options
    Studentul_86Studentul_86 Member Posts: 11 Learner I
    Just to add that the read document operator I see allow only csv, xls, url, spss, stata, sparse, arff, xrff, Dbase, c4.5, dasyLab, xml, or access format files. Or I do not know how to find?
  • Options
    Studentul_86Studentul_86 Member Posts: 11 Learner I
    Hello Martin,

    Thank you for the advice. I've tried, but unfortunately I see this loop file is used in case I want to concatenate multiple Read Document operators. Which is the solution for me to import at once through a Process Documents operator about 300 pdf files?

    Thank you for your support.

    Best regards,

    Vali.
  • Options
    Studentul_86Studentul_86 Member Posts: 11 Learner I
    Hello Lionel,

    I've achieved to created the set of documents uploaded in my RapidMiner process.
    However, now I face with a strange situation. All pdf files uploaded on my process do not lead to a word list. In the files attached you can see the process I've designed, a really simple one. Instead the results show nothing, no word, no list of documents analyzed. What did I do wrong? This process was tested only for 5 pdf files uploaded.

    The idea is simple what I need to do with those about 300 hundred pdf files. I want to:
    - create a list with the words and their count on the files;
    - get the files length (number of words);
    - get the correlation between words, for some specific terms;
    - get a set of graphical associations for those specific terms;
    etc.

    Unfortunately I'm stuck on the very beginning of the process. I need your advice or anybody else from this community.

    Thank you,

    Best regards,
    Vali.
  • Options
    kaymankayman Member Posts: 662 Unicorn
    Seems you are using the wrong output port if I look at the main process blp. The data is at out 1 (notice the color) while you attach to out 2.
  • Options
    Studentul_86Studentul_86 Member Posts: 11 Learner I
    Hello @kayman,

    I've made the change you recommended me. Now it shows me that the loop file is not properly working because there are not enough iterations? What that mean...? PLease help me with what should I still have to change...? Attached you can find the error I'm talking about.

    Thank you,
    Vali.
  • Options
    kaymankayman Member Posts: 662 Unicorn
    That's quite complex to see based on screenshots, your pruning might be to tough, you may loose the content somewhere else etc. 

    Your current flow works on a single pdf a time, where you most likely need all of these combined to get some decent tfidf results. 

    Just try to ensure you already get something in the first place. Loop through the pdf's just combine them and see if you get results. Start with a few, combine these and see if you get content in the first place using the combine documents operator.

    Then use tfidf on that one, tuning the prune on the go. 
Sign In or Register to comment.