RapidMiner 9.7 is Now Available
Lots of amazing new improvements including true version control! Learn more about what's new here.
[SOLVED] enhance accuracy of PDF classifier
-> I want to classifiy PDF documents into multiple categories by their text content.
2. If you are working with data, give a detailed description of your data (number of examples and attributes, attribute types, label type etc.).
-> There exists a database with multiple PDF documents already classified. hundreds of scanned PDF documents with OCR. Categories 150 (customer A, B, C..., topics printer, monitor, laptop, etc...).
The PDF content is extracted from the file. Then the text is cut off after 2000 characters, tokenized, transformed to lower case, stopwords are removed, and the token length is filtered. Still get 4000 attributes...
3. Describe which results or actions you are expecting.
-> The programm should get an accuracy of 98% or above to classify the pdf files
-> the programm shouldn't take seconds to learn each pdf document.
4. Describe which results you actually get.
-> I get an accuracy on training example of about 70-80% if I use SVM with standard values. (my knowledge about SVM parameters is limited)
-> I get an accuracy of about 85% if I use naive bayes and it is learning very fast
-> I have a performance issue if I use other models than SVM or naive bayes. Training a neural net model with 4000 attributes takes 1 second per pdf file... (Core i7).
-- How to deal with so many tokens/attributes from PDF file content?
-- How to adjust the parameters of the SVM to address more accuracy on tokenised text?
-- How to use a neural net model in this case? (adjust learning rates etc?!?)