Compare 2 pdf texts

c_sabinec_sabine Member Posts: 8 Contributor I
edited December 2018 in Help

Hello, 

I'm trying to create a process which consist on comparing 2 pdf that are subtly different.

I process my documents (tokenize, filter stopwords, generate n grams...) from two differents files and merge it into one common example set with the operator "Append" and use the operator "Remove duplicates" to see differences in the pdf. Please find attached my process, I have 2 questions :

1) Is it possible to convert my example set result into a wordlist to have a table by row rather than column ?

2) It seems that something went wrong because there are words which are in the 2 files which appears in the output, while it should show words that are in a specific document and whiich is absent in the other one, and so on

 

Thanks !

 

Sabine

 

 

 

Tagged:

Answers

  • c_sabinec_sabine Member Posts: 8 Contributor I

    Please find attached a screen of my process, the second pictures describe what is contained inside the two operators "Process document from files".

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    When you generate the original wordlist from each pdf, you can use "Wordlist to Data" operator to create examplesets of the words and their counts. You could then add a source field (with Generate Attributes or via a macro) for each pdf, and then merge/join those two datasets.  That should enable you to see easily which words are common to both files and which ones are unique to one or the other.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.