Options

PDF Table extraction into data

robinrobin Member Posts: 100 Guru
edited November 2019 in Help

The Extract PDF Tables seems to be a relativly new extension and I do not see much discussion around it.

 

I have multiple PDF documents from which I need to extract the data contained in the tables. The output of this operator is an IO Object collection. Due to the fact that there are tables within tables, it means that there is not a uniform output in the example sets and as a result I am unable to use the append operator. 

 

I am also stumped as to how to convert this collection into data so that I can use the other available opperators to clean it. 

 

What is the best practise in terms of using this operator?

 

 

 

@sgenzer do I get a free shirt?

Tagged:

Best Answer

  • Options
    robinrobin Member Posts: 100 Guru
    Solution Accepted
    The new PDF operator solves this issue.

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,517 RM Data Scientist

    Dear Robin,

     

    please have a look at either the Select Operator which is able to select a single set or the Loop Collection operator which is able to loop over the whole collection. You can also use the later to store each element of your collection with a different name.

     

    Cheers,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    eyey Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 21 RM Research

    Hi @robin

    Good to hear that you are using the PDF Table Extraction extension. This operator delivers data tables by using edge detection and extraction algorithms which try to calibrate the region on the page where a data table "might" exist. These are tested to work the best on clearly defined tabular structure, meaning where the structure of table is not too complicated (nested tables in header and rows) or the document is not filled in with rich graphics all over the document. The internal calibration mechanism can result in a lot of tables (some or many of them may be noisy) but atleast one of these is a good representation of a data table in your PDF. The operator was never meant for overly complex table structures but if you can share more insights on how your tables look like, i could add it as test cases for future releases.

     

    As @mschmitz suggested, there are operators to Select or Loop over Collection which you can use to extract the right tables out and then "adjust" them, e.g. the Rename by Example Values can come in handy to create header from an example and some regular expressions in Replace operators can also help in cleansing etc. Finally, since this extension is developed as part of a public project, I would appreciate if you could share with us in general terms what kind of data you are dealing with thats found in your PDF documents. This feedback helps us understand PDF data targets better since PDF is the most widely used format that stores data tables after the HTML documents.

     

    Best Regards,

    Edwin

     

    ps: there is a Blog on this extension in case you have not seen it already: https://community.rapidminer.com/t5/Community-Blog/PDF-Table-Extraction-Extension-Released/ba-p/37490

  • Options
    robinrobin Member Posts: 100 Guru

    HI Edwin

     

    I know the blog very well, I have also used the process included with it in an attempt to read the PDF data into SQL. 

     

    My problem to date has been that the data is in multiple example sets when extracted from the PDF. For some reason I am unable select the correct example sets and marry them together. 

     

    On the Select operator lets me select a single example set, which is not what I require. I need to select every second example set and join them togeter. The Loop Collection operator returns the same results as the actual PDF extraction, maybe I am missing something I should be placing inside this operator. 

     

    I am happy to share the detail of what is contained in the PDF extraction process but won't be able to do this on a public forum. Where can I make contact with you?

     

     

Sign In or Register to comment.