[SOLVED] Read PDF with images for Text Mining
I'm new on RapidMiner and I need to build a process which counts the words in a folder containing thousands files of differents type including PDF.
I built a first process only for HTML files which works as I want but I have a problem with some PDF files.
In fact when a PDF has at least one image inside it is unreadable whereas there is no problem with other PDF.
I work on Windows 7 64 bits, RapidMiner 5.3.0 64bits with all extentions installed and Java 64 bits.
Currently I use only the "Read Document" Operator. When I run the process I have the following pop-up message:
And the following log:
The setup does not seem to contain any obvious errors, but you should check the log messages or activate the debug mode in the settings dialog in order to get more information about this problem.
Can someone help me to mining the text in a PDF file which has images inside?
Feb 27, 2013 2:12:24 PM INFO: No filename given for result file, using stdout for logging results!
Feb 27, 2013 2:12:24 PM INFO: Process starts
Feb 27, 2013 2:12:24 PM INFO: Loading initial data.
Feb 27, 2013 2:12:24 PM SEVERE: Process failed: operator cannot be executed. Check the log messages...
Feb 27, 2013 2:12:24 PM SEVERE: Here: Process (Process)
subprocess 'Main Process'
==> +- Read Document (Read Document)
Feb 27, 2013 2:12:24 PM SEVERE: java.lang.NullPointerException
Thanks in advance for your replies