[SOLVED] enhance accuracy of PDF classifier

TheBenTheBen Member Posts: 11 Contributor II
edited November 2018 in Help
1. Describe what you are doing
-> I want to classifiy PDF documents into multiple categories by their text content.  

2. If you are working with data, give a detailed description of your data (number of examples and attributes, attribute types, label type etc.).
-> There exists a database with multiple PDF documents already classified. hundreds of scanned PDF documents with OCR. Categories 150 (customer A, B, C..., topics printer, monitor, laptop, etc...).
The PDF content is extracted from the file. Then the text is cut off after 2000 characters, tokenized, transformed to lower case, stopwords are removed, and the token length is filtered. Still get 4000 attributes...

3. Describe which results or actions you are expecting.
-> The programm should get an accuracy of 98% or above to classify the pdf files
-> the programm shouldn't take seconds to learn each pdf document.

4. Describe which results you actually get.
-> I get an accuracy on training example of about 70-80% if I use SVM with standard values. (my knowledge about SVM parameters is limited)
-> I get an accuracy of about 85% if I use naive bayes and it is learning very fast
-> I have a performance issue if I use other models than SVM or naive bayes. Training a neural net model with 4000 attributes takes 1 second per pdf file... (Core i7).

-- How to deal with so many tokens/attributes from PDF file content?
-- How to adjust the parameters of the SVM to address more accuracy on tokenised text?
-- How to use a neural net model in this case? (adjust learning rates etc?!?)


  • Options
    TheBenTheBen Member Posts: 11 Contributor II
    ok here is what I have found

    - the operator "process documents" has a prune-method. You can use it to reduce the number of attributes by their statistical occurance.

    - the "replace" operator can be used to cut the text after a given amount of characters. Use a regular expression:
    ->replace what: (?s)(.{number of characters}).*
    ->replace by: $1

    - filter tokens by length
    - filter stopwords
    - filter verbs (because this is in most cases no relevant information for classification). Solution 1: use "Filter Tokens (by POS Tags) operator" or solution 2: use "filter stopwords (dictionary)" and use a text file that contains all verbs you want to filter.
    - stemming (but this can result in loss of crucial information)
  • Options
    Nils_WoehlerNils_Woehler Member Posts: 463 Maven

    to filter verbs you can use the Filter Tokens (by POS Tags) operator.

  • Options
    TheBenTheBen Member Posts: 11 Contributor II
    Here is my preferred solution so far:

    Finally, I use SVM as the model, but Naive Bayes is also quite good (up to 98% accuracy) if there are many examples for each category.
    Make sure you filter superfluous tokens. To do this you can use the "filter stopwords (dictionary)". A text file is created by yourself and contains useless tokens in each line. Define as most tokens as you know that are useless to improve accuracy and speed. It is VERY important for the classification accuracy to filter tokens and remain only the most important ones. I ended up using the pruning method "percentual" with the values "1.0" and "50.0" in addition to the other filters mentioned above. Check that there are all values set correctly in each exampleset, So, check all "?" especially in the label column!

    future work:
    My second step is to use "ImageMining Extensions" to get a second prediction.
    -> alternative 1: http://splab.cz/en/download/software/immi-rapidminer-extenison
    -> alternative 2: http://madm.dfki.de/rapidminer/imagemining
    I plan to converge both predictions to get more confidence (and to enhance the detection of documents that are just handwritten notes for example. But now the problem is that there is no operator (to my knowledge) that extracts the (first) picture which is embedded in the PDF file.
Sign In or Register to comment.