how to process multiple MS Word into Rapidminer?

kevinace · November 2020

Dear All
I want to process multiple MS Word files.

If I use 'Process Documents from Files' as per the tutorial, the file content looks corrupted. For example, file name: helloworld.docx, with the content of only 2 words: hello world. Rapidminer will produce a trunk of unrelated words as output.
I understand I can use 'read office file' to read the MS Word documents into exact content, however, this extension can use for 1 file at a time only.
How do I mingle between these 2 processing tools or if there are additional tools I could use? Because either I do 'read office file -> process documents from files -> res' OR 'process documents from files -> read office file -> rex' does not seems computer logic.

My ideal objective is to load a batch of MS Word files for Readability analysis. Such as using SMOG, FOG etc indexes to check the readability of mass contents, so I can gather more data samples for a university research paper.

Thanks a lot!

MartinLiebig · November 2020

Hi,

Loop Files + Read Office are the two operators you need to combine.

Best,

Martin

kevinace · November 2020

Dear Martin

how do i setup the parameters for 'loop file' operator to load multiple MS Word into Rapidminer?
The setting i did is 'loop file' - 'read office file' - rest
Loop file:
Directory: C:/Users/user/Downloads/t1
filter type: Glob
Filter by glob: .*doc
Enable parallel execution

if filter by glob is .*doc, "not enough iterations: the minimum number of iterations must not be smaller than 1.
if filter by glob is: *.doc, error type: input is missing, the previous operator loop file did not product any output.
There are 3 files in the t1 folder, 2 .doc file and a .docx file

I also looked up on google how to use Loop File, however the 2018 youtube videos parameter setting seems no longer valid with the current version....
Looking forward for your replies

With thanks!

Kevin

MartinLiebig · December 2020

Hi,

don't use glob but regex, that should do the trick

Best,

Martin

kevinace · December 2020

Dear Martin

I tried with what we discussed, what's still missing?
Please see screenshot attached, thanks.
read office file parameter is default with detect file type. thanks.

(There are only 2 doc files in the t1 folder)

Image: https://us.v-cdn.net/6030995/uploads/editor/nb/law0dz3zhstw.jpg

MartinLiebig · December 2020

Hi,

you want to put the read inside the loop files. Attached is an example

Best,

Martin

<?xml version="1.0" encoding="UTF-8"?><process version="9.8.000">
<context>
    <input/>
    <output/>
    <macros/>
</context>
<operator activated="true" class="process" compatibility="9.8.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="concurrency:loop_files" compatibility="9.8.000" expanded="true" height="82" name="Loop Files" width="90" x="514" y="34">
        <parameter key="filter_type" value="regex"/>
        <parameter key="filter_by_regex" value=".*docx"/>
        <parameter key="recursive" value="false"/>
        <parameter key="enable_macros" value="false"/>
        <parameter key="macro_for_file_name" value="file_name"/>
        <parameter key="macro_for_file_type" value="file_type"/>
        <parameter key="macro_for_folder_name" value="folder_name"/>
        <parameter key="reuse_results" value="false"/>
        <parameter key="enable_parallel_execution" value="true"/>
        <process expanded="true">
          <operator activated="true" class="operator_toolbox:read_word_files" compatibility="2.8.000-SNAPSHOT" expanded="true" height="68" name="Read Office File" width="90" x="246" y="34">
            <parameter key="detect_file_type" value="true"/>
            <parameter key="file_extension" value="docx"/>
          </operator>
          <connect from_port="file object" to_op="Read Office File" to_port="file"/>
          <connect from_op="Read Office File" from_port="doc" to_port="output 1"/>
          <portSpacing port="source_file object" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Add directory here</description>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="9.3.001" expanded="true" height="82" name="Documents to Data" width="90" x="715" y="34">
        <parameter key="add_meta_information" value="true"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="use_processed_text" value="false"/>
      </operator>
      <connect from_op="Loop Files" from_port="output 1" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
</operator>
</process>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

how to process multiple MS Word into Rapidminer?

Answers