RapidMiner

Reading Microsoft word documents (word count)

Contributor

Reading Microsoft word documents (word count)

Hi,
  I did some searching on this topic and found almost nothing on reading DOC and DOCX documents with 'Read Document' step. Is this possible without converting MS word document to a supported format (e.g. CSV,PDF, RTF, HTML)? I have 1000's of word documents so I would like to read them without pre-processing.

Regards,
Serge
3 REPLIES
Moderator

Re: Reading Microsoft word documents (word count)

Hi,

I'm afraid that is currently not possible.

Regards,
Marco
_________________________________________________________
Team Lead Software Engineering | RapidMiner GmbH
Regular Contributor

Re: Reading Microsoft word documents (word count)

Hi

I have the same problem.
Currently I use a bash script to convert DOC and DOCX but I would like to avoid this pre-processing step.
Please let me know if you find something that can help.

Regards
Johan
Highlighted

Re: Reading Microsoft word documents (word count)

Unfortunately RapidMiner is not capable of dealing with Word documents natively. You have to use a command line tool to extract the text, e.g. antiword: http://www-stud.rbi.informatik.uni-frankfurt.de/~markus/antiword/

You can run the program from your RapidMiner process with the Execute Program operator.

Best regards,
Marius