Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"Reading Microsoft word documents (word count)"

SergeMerzSergeMerz Member Posts: 1 Learner III
edited June 2019 in Help
Hi,
  I did some searching on this topic and found almost nothing on reading DOC and DOCX documents with 'Read Document' step. Is this possible without converting MS word document to a supported format (e.g. CSV,PDF, RTF, HTML)? I have 1000's of word documents so I would like to read them without pre-processing.

Regards,
Serge

Answers

  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,996 RM Engineering
    Hi,

    I'm afraid that is currently not possible.

    Regards,
    Marco
  • johan_CGjohan_CG Member Posts: 19 Contributor II
    Hi

    I have the same problem.
    Currently I use a bash script to convert DOC and DOCX but I would like to avoid this pre-processing step.
    Please let me know if you find something that can help.

    Regards
    Johan
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Unfortunately RapidMiner is not capable of dealing with Word documents natively. You have to use a command line tool to extract the text, e.g. antiword: http://www-stud.rbi.informatik.uni-frankfurt.de/~markus/antiword/

    You can run the program from your RapidMiner process with the Execute Program operator.

    Best regards,
    Marius
Sign In or Register to comment.