Options

Hello everyone! I have just recently started working with Rapidminer and would like to use loops to

ramyramy Member Posts: 6 Newbie
Hello everyone! I have just recently started working with Rapidminer and would like to use loops to preprocess a folder of text files and output the processed files to another folder. But with the operator Write Document I get this error message: Expected Document but received IOObjectCollection. I would be very grateful if someone could take a look at it!

Kind regards,
Ramy

Best Answer

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Solution Accepted
    Hi,
    so you got not one Document, but many documents. If you want to write all of them into different files you would need to use a Loop Collection and then write the documents induvidually.

    If you want to write one big document,you can use Combine Documents to merge them into one and then use Write Document.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • Options
    ramyramy Member Posts: 6 Newbie
    Hi, thank you so much!

    I have tried to implement this. At least now I don't get an error message anymore. I also get the results displayed correctly. So your hint has already helped me a lot!

    However, not all 24 files in the source folder are written to the destination folder in the preprocessed version, but only two: A text file with the revised text and the original's file name (which is exactly what I wanted to have) and another file called "1", but also containing the desired preprocessing result of another original file.

    I would be very happy if you could help me to get all files in the original folder preprocessed and written to the destination folder with the desired filename! :)

    Best regards,
    Ramy

    P.S.: Sorry I didn't include my code right away, I had accidentally posted a draft message ... :(

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.003"><br>  <context><br>    <input/><br>    <output/><br>    <macros/><br>  </context><br>  <operator activated="true" class="process" compatibility="7.6.003" expanded="true" name="Process"><br>    <process expanded="true"><br>      <operator activated="true" class="concurrency:loop_files" compatibility="7.6.003" expanded="true" height="82" name="Loop Files" width="90" x="313" y="34"><br>        <parameter key="directory" value="C:\Users\r\P4_XML_TCP\test"/><br>        <parameter key="filter_by_glob" value="*.txt"/><br>        <parameter key="enable_macros" value="true"/><br>        <process expanded="true"><br>          <operator activated="true" class="text:read_document" compatibility="7.5.000" expanded="true" height="68" name="Read Document" width="90" x="112" y="136"/><br>          <operator activated="true" class="multiply" compatibility="7.6.003" expanded="true" height="103" name="Multiply" width="90" x="246" y="136"/><br>          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="380" y="238"/><br>          <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="514" y="238"/><br>          <operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data" width="90" x="380" y="34"><br>            <parameter key="text_attribute" value="file_name"/><br>            <parameter key="label_attribute" value="file_name"/><br>          </operator><br>          <operator activated="true" class="extract_macro" compatibility="7.6.003" expanded="true" height="68" name="Extract Macro" width="90" x="514" y="34"><br>            <parameter key="macro" value="file_name"/><br>            <list key="additional_macros"/><br>          </operator><br>          <connect from_port="file object" to_op="Read Document" to_port="file"/><br>          <connect from_op="Read Document" from_port="output" to_op="Multiply" to_port="input"/><br>          <connect from_op="Multiply" from_port="output 1" to_op="Documents to Data" to_port="documents 1"/><br>          <connect from_op="Multiply" from_port="output 2" to_op="Tokenize" to_port="document"/><br>          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/><br>          <connect from_op="Transform Cases" from_port="document" to_port="output 1"/><br>          <connect from_op="Documents to Data" from_port="example set" to_op="Extract Macro" to_port="example set"/><br>          <portSpacing port="source_file object" spacing="0"/><br>          <portSpacing port="source_input 1" spacing="0"/><br>          <portSpacing port="sink_output 1" spacing="0"/><br>          <portSpacing port="sink_output 2" spacing="0"/><br>        </process><br>      </operator><br>      <operator activated="true" class="loop_collection" compatibility="7.6.003" expanded="true" height="103" name="Loop Collection" width="90" x="447" y="34"><br>        <parameter key="unfold" value="true"/><br>        <process expanded="true"><br>          <operator activated="true" class="text:write_document" compatibility="7.5.000" expanded="true" height="82" name="Write Document (2)" width="90" x="179" y="85"><br>            <parameter key="file" value="C:\Users\r\P4_XML_TCP\testoutput\%{file_name}"/><br>          </operator><br>          <connect from_port="single" to_op="Write Document (2)" to_port="document"/><br>          <connect from_op="Write Document (2)" from_port="document" to_port="output 1"/><br>          <connect from_op="Write Document (2)" from_port="file" to_port="output 2"/><br>          <portSpacing port="source_single" spacing="0"/><br>          <portSpacing port="sink_output 1" spacing="0"/><br>          <portSpacing port="sink_output 2" spacing="0"/><br>          <portSpacing port="sink_output 3" spacing="0"/><br>        </process><br>      </operator><br>      <connect from_op="Loop Files" from_port="output 1" to_op="Loop Collection" to_port="collection"/><br>      <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/><br>      <portSpacing port="source_input 1" spacing="0"/><br>      <portSpacing port="sink_result 1" spacing="0"/><br>      <portSpacing port="sink_result 2" spacing="0"/><br>    </process><br>  </operator><br></process><br><br>







  • Options
    ramyramy Member Posts: 6 Newbie
    <?xml version="1.0" encoding="UTF-8"?>
  • Options
    ramyramy Member Posts: 6 Newbie
    Okay, for some reason the XML code is not displayed correctly, I hope it's ok if I add it here as a file! :# 


    Best,
    Ramy
  • Options
    ramyramy Member Posts: 6 Newbie
    <?xml version="1.0" encoding="UTF-8"?><process version="9.10.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.4.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:loop_files" compatibility="8.2.000" expanded="true" height="82" name="Loop Files" width="90" x="112" y="34">
            <parameter key="directory" value="C:\Users\r\P4_XML_TCP\test"/>
            <parameter key="filter_type" value="glob"/>
            <parameter key="filter_by_glob" value="*.txt"/>
            <parameter key="recursive" value="false"/>
            <parameter key="enable_macros" value="true"/>
            <parameter key="macro_for_file_name" value="file_name"/>
            <parameter key="macro_for_file_type" value="file_type"/>
            <parameter key="macro_for_folder_name" value="folder_name"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="text:read_document" compatibility="9.3.001" expanded="true" height="68" name="Read Document" width="90" x="246" y="34">
                <parameter key="extract_text_only" value="true"/>
                <parameter key="use_file_extension_as_type" value="true"/>
                <parameter key="content_type" value="txt"/>
                <parameter key="encoding" value="SYSTEM"/>
              </operator>
              <connect from_port="file object" to_op="Read Document" to_port="file"/>
              <connect from_op="Read Document" from_port="output" to_port="output 1"/>
              <portSpacing port="source_file object" spacing="0"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="9.10.000" expanded="true" height="82" name="Loop Collection (2)" width="90" x="313" y="34">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="true"/>
            <process expanded="true">
              <operator activated="true" class="text:cut_document" compatibility="8.2.000" expanded="true" height="68" name="Cut Document" width="90" x="246" y="34">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <parameter key="attribute_type" value="Nominal"/>
                <list key="regular_expression_queries">
                  <parameter key="sentence" value="(.+)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <parameter key="ignore_CDATA" value="true"/>
                <parameter key="assume_html" value="true"/>
                <list key="index_queries"/>
                <list key="jsonpath_queries"/>
                <process expanded="true">
                  <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34">
                    <parameter key="mode" value="non letters"/>
                    <parameter key="characters" value=".:"/>
                    <parameter key="language" value="English"/>
                    <parameter key="max_token_length" value="3"/>
                  </operator>
                  <operator activated="true" class="text:transform_cases" compatibility="9.3.001" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="34">
                    <parameter key="transform_to" value="lower case"/>
                  </operator>
                  <connect from_port="segment" to_op="Tokenize" to_port="document"/>
                  <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
                  <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="single" to_op="Cut Document" to_port="document"/>
              <connect from_op="Cut Document" from_port="documents" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="9.10.000" expanded="true" height="103" name="Loop Collection" width="90" x="514" y="34">
            <parameter key="set_iteration_macro" value="false"/>
            <parameter key="macro_name" value="iteration"/>
            <parameter key="macro_start_value" value="1"/>
            <parameter key="unfold" value="true"/>
            <process expanded="true">
              <operator activated="true" class="multiply" compatibility="9.10.000" expanded="true" height="103" name="Multiply" width="90" x="112" y="85"/>
              <operator activated="true" class="text:write_document" compatibility="9.3.001" expanded="true" height="82" name="Write Document (2)" width="90" x="313" y="187">
                <parameter key="file" value="C:\Users\r\P4_XML_TCP\testoutput\%{file_name}prep.%{file_type}"/>
                <parameter key="overwrite" value="true"/>
                <parameter key="encoding" value="SYSTEM"/>
              </operator>
              <operator activated="true" class="text:documents_to_data" compatibility="9.3.001" expanded="true" height="82" name="Documents to Data" width="90" x="313" y="34">
                <parameter key="text_attribute" value="file_name"/>
                <parameter key="label_attribute" value="file_name"/>
                <parameter key="add_meta_information" value="true"/>
                <parameter key="datamanagement" value="double_sparse_array"/>
                <parameter key="data_management" value="auto"/>
                <parameter key="use_processed_text" value="false"/>
              </operator>
              <operator activated="true" class="extract_macro" compatibility="9.10.000" expanded="true" height="68" name="Extract Macro" width="90" x="447" y="34">
                <parameter key="macro" value="file_name"/>
                <parameter key="macro_type" value="number_of_examples"/>
                <parameter key="statistics" value="average"/>
                <parameter key="attribute_name" value=""/>
                <list key="additional_macros"/>
              </operator>
              <connect from_port="single" to_op="Multiply" to_port="input"/>
              <connect from_op="Multiply" from_port="output 1" to_op="Documents to Data" to_port="documents 1"/>
              <connect from_op="Multiply" from_port="output 2" to_op="Write Document (2)" to_port="document"/>
              <connect from_op="Documents to Data" from_port="example set" to_op="Extract Macro" to_port="example set"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
              <portSpacing port="sink_output 3" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Loop Files" from_port="output 1" to_op="Loop Collection (2)" to_port="collection"/>
          <connect from_op="Loop Collection (2)" from_port="output 1" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
          <connect from_op="Loop Collection" from_port="output 2" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>


  • Options
    ramyramy Member Posts: 6 Newbie
    Okay, issue solved :) I just had to add the iteration macro in the Write Document Operator. Thanks for helping! :)
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    ahhh, well done @ramy!
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.