how to process multiple MS Word into Rapidminer?

kevinacekevinace Member Posts: 6 Newbie
Dear All
I want to process multiple MS Word files.

If I use 'Process Documents from Files' as per the tutorial, the file content looks corrupted. For example, file name: helloworld.docx, with the content of only 2 words: hello world. Rapidminer will produce a trunk of unrelated words as output.
I understand I can use 'read office file' to read the MS Word documents into exact content, however, this extension can use for 1 file at a time only. 
How do I mingle between these 2 processing tools or if there are additional tools I could use? Because either I do 'read office file -> process documents from files -> res' OR 'process documents from files -> read office file -> rex' does not seems computer logic. 

My ideal objective is to load a batch of MS Word files for Readability analysis. Such as using SMOG, FOG etc indexes to check the readability of mass contents, so I can gather more data samples for a university research paper. 

Thanks a lot!

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,282 RM Data Scientist
    Hi,
    Loop Files + Read Office are the two operators you need to combine.

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • kevinacekevinace Member Posts: 6 Newbie
    Dear Martin

    how do i setup the parameters for 'loop file' operator to load multiple MS Word into Rapidminer?
    The setting i did is 'loop file' - 'read office file' - rest
    Loop file: 
    Directory: C:/Users/user/Downloads/t1
    filter type: Glob
    Filter by glob: .*doc
    Enable parallel execution

    if filter by glob is .*doc, "not enough iterations: the minimum number of iterations must not be smaller than 1. 
    if filter by glob is: *.doc, error type: input is missing, the previous operator loop file did not product any output.
    There are 3 files in the t1 folder, 2 .doc file and a .docx file

    I also looked up on google how to use Loop File, however the 2018 youtube videos parameter setting seems no longer valid with the current version.... 
    Looking forward for your replies 

    With thanks!

    Kevin

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,282 RM Data Scientist
    Hi,
    don't use glob but regex, that should do the trick :)

    Best,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • kevinacekevinace Member Posts: 6 Newbie
    Dear Martin

    I tried with what we discussed, what's still missing?
    Please see screenshot attached, thanks. 
    read office file parameter is default with detect file type. thanks.

    (There are only 2 doc files in the t1 folder)




  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,282 RM Data Scientist
    Hi,
    you want to put the read inside the loop files. Attached is an example

    Best,
    Martin

    <?xml version="1.0" encoding="UTF-8"?><process version="9.8.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.8.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:loop_files" compatibility="9.8.000" expanded="true" height="82" name="Loop Files" width="90" x="514" y="34">
            <parameter key="filter_type" value="regex"/>
            <parameter key="filter_by_regex" value=".*docx"/>
            <parameter key="recursive" value="false"/>
            <parameter key="enable_macros" value="false"/>
            <parameter key="macro_for_file_name" value="file_name"/>
            <parameter key="macro_for_file_type" value="file_type"/>
            <parameter key="macro_for_folder_name" value="folder_name"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="operator_toolbox:read_word_files" compatibility="2.8.000-SNAPSHOT" expanded="true" height="68" name="Read Office File" width="90" x="246" y="34">
                <parameter key="detect_file_type" value="true"/>
                <parameter key="file_extension" value="docx"/>
              </operator>
              <connect from_port="file object" to_op="Read Office File" to_port="file"/>
              <connect from_op="Read Office File" from_port="doc" to_port="output 1"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">Add directory here</description>
          </operator>
          <operator activated="true" class="text:documents_to_data" compatibility="9.3.001" expanded="true" height="82" name="Documents to Data" width="90" x="715" y="34">
            <parameter key="add_meta_information" value="true"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="use_processed_text" value="false"/>
          </operator>
          <connect from_op="Loop Files" from_port="output 1" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>




    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.