"Reading Microsoft word documents (word count)"

SergeMerzSergeMerz Member Posts: 1 Contributor I
edited June 2019 in Help
Hi,
  I did some searching on this topic and found almost nothing on reading DOC and DOCX documents with 'Read Document' step. Is this possible without converting MS word document to a supported format (e.g. CSV,PDF, RTF, HTML)? I have 1000's of word documents so I would like to read them without pre-processing.

Regards,
Serge

Answers

  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    I'm afraid that is currently not possible.

    Regards,
    Marco
  • johan_CGjohan_CG Member Posts: 19 Contributor II
    Hi

    I have the same problem.
    Currently I use a bash script to convert DOC and DOCX but I would like to avoid this pre-processing step.
    Please let me know if you find something that can help.

    Regards
    Johan
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Unfortunately RapidMiner is not capable of dealing with Word documents natively. You have to use a command line tool to extract the text, e.g. antiword: http://www-stud.rbi.informatik.uni-frankfurt.de/~markus/antiword/

    You can run the program from your RapidMiner process with the Execute Program operator.

    Best regards,
    Marius
Sign In or Register to comment.