"Loading Adobe/Word into Rapidminer"

ben_buhlben_buhl Member Posts: 3 Contributor I
edited June 2019 in Help

Hi All,

 

I want to load some Adobe documents into Rapidminer so I can calculate word frequencies.  I am able to do this with Excel sheets but can't seem to load the Adobe doc into it.  Please let me know what operators I need to load either Adobe or Word docs into Rapidminer to calculate word frequencies.  

 

Thanks.

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    You can load PDF, TXT, HTML, and XML files only. DOCX is not supported. 

  • ben_buhlben_buhl Member Posts: 3 Contributor I

    Thank you, that is helpful.  Can you tell me what operators I will need to make this work?

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Sure. If the files are in a directory, then use the Process Documets from Folders operator. This operator is found in the Text Processing extension available on the marketplace.

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    Actually reading DOCX is supported as well.  Please see this sample process.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="open_file" compatibility="7.4.000" expanded="true" height="68" name="Open File" width="90" x="246" y="187">
    <parameter key="filename" value="C:\Users\think\Documents\MyWordDoc.docx"/>
    </operator>
    <operator activated="true" class="loop_zipfile_entries" compatibility="7.4.000" expanded="true" height="82" name="Read Word Document" width="90" x="581" y="187">
    <parameter key="internal_directory" value="word"/>
    <parameter key="filter" value="document\.xml"/>
    <process expanded="true">
    <operator activated="true" class="text:read_document" compatibility="7.4.001" expanded="true" height="68" name="Read Document" width="90" x="179" y="238">
    <parameter key="content_type" value="xml"/>
    </operator>
    <connect from_port="file object" to_op="Read Document" to_port="file"/>
    <connect from_op="Read Document" from_port="output" to_port="out 1"/>
    <portSpacing port="source_file object" spacing="0"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Open File" from_port="file" to_op="Read Word Document" to_port="file"/>
    <connect from_op="Read Word Document" from_port="out 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Huh, will you look at that. You taught me a new trick @JEdward! Thanks!

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    MSOffice documents are actually just zip files. 

    It also works with PPTX documents too, but you need to do need to change the Loop Zip Files from my example to loop through each slide as they store them in separate XML documents. 

     

    Have fun!

Sign In or Register to comment.