Cut the document in equal pieces

In777In777 Member Posts: 29 Contributor II
edited November 2018 in Help

Hello,

How can I cut the documents in equal parts (e.g. 10 parts) and save them as separate documents. I cannot use regex (an so the cut doc operator), since I have several documents of different format. I tried the window-operator, but I do not quiet understand how to use it. Thank you in advance for any help!

Best Answer

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,453 RM Data Scientist
    Solution Accepted

    Hi ln777,

    the window documents operator should do the job. You might need to extract the number of tokens for every document. For an example, please have a look at the attached process.

     

    Best,

    Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
    <context>
    <input/>
    <output/>
    <macros>
    <macro>
    <key>window_length</key>
    <value>5</value>
    </macro>
    </macros>
    </context>
    <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="7.2.001" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
    <parameter key="text" value="Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."/>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="7.2.001" expanded="true" height="68" name="Tokenize" width="90" x="313" y="34"/>
    <operator activated="false" class="text:process_documents" compatibility="7.2.001" expanded="true" height="82" name="Process Documents" width="90" x="380" y="238">
    <process expanded="true">
    <operator activated="false" class="text:window_document" compatibility="7.2.001" expanded="true" height="68" name="Window Document" width="90" x="112" y="34">
    <parameter key="step_size" value="10"/>
    <process expanded="true">
    <portSpacing port="source_segment" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    </process>
    </operator>
    <connect from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:aggregate_token_length" compatibility="7.2.001" expanded="true" height="68" name="Aggregate Token Length" width="90" x="447" y="34">
    <parameter key="aggregation" value="count"/>
    </operator>
    <operator activated="true" breakpoints="after" class="subprocess" compatibility="7.3.000" expanded="true" height="103" name="Subprocess" width="90" x="581" y="34">
    <process expanded="true">
    <operator activated="true" class="multiply" compatibility="7.3.000" expanded="true" height="103" name="Multiply" width="90" x="45" y="34"/>
    <operator activated="true" class="text:documents_to_data" compatibility="7.2.001" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="85">
    <parameter key="text_attribute" value="text"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="7.3.000" expanded="true" height="68" name="Extract Macro" width="90" x="313" y="187">
    <parameter key="macro" value="number_of_tokens"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="token_length"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="generate_macro" compatibility="7.3.000" expanded="true" height="82" name="Generate Macro" width="90" x="447" y="187">
    <list key="function_descriptions">
    <parameter key="window_length" value="replace(str(eval(%{number_of_tokens})/10),&quot;.0&quot;,&quot;&quot;)"/>
    </list>
    </operator>
    <connect from_port="in 1" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_port="out 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
    <connect from_op="Generate Macro" from_port="through 1" to_port="out 2"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <portSpacing port="sink_out 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:window_document" compatibility="7.2.001" expanded="true" height="68" name="Window Document (2)" width="90" x="782" y="34">
    <parameter key="window_length" value="%{window_length}"/>
    <parameter key="step_size" value="%{window_length}"/>
    <process expanded="true">
    <connect from_port="segment" to_port="document 1"/>
    <portSpacing port="source_segment" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Aggregate Token Length" to_port="document"/>
    <connect from_op="Aggregate Token Length" from_port="document" to_op="Subprocess" to_port="in 1"/>
    <connect from_op="Subprocess" from_port="out 1" to_op="Window Document (2)" to_port="document"/>
    <connect from_op="Window Document (2)" from_port="documents" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • In777In777 Member Posts: 29 Contributor II

    Thank you for the process. It works nicely!

  • In777In777 Member Posts: 29 Contributor II

    I have also follow-up question to the process. If I put this process into the loop files, how can I write all parts as separate txt.documents to one directory?

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,453 RM Data Scientist

    Hi,

     

    the trick is the Loop Collection operator to loop over seperate windows. A handy thing is to use the macro %{a} to store it with a the execution_count as a postfix.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • In777In777 Member Posts: 29 Contributor II

    Hi Martin,

     

    I used loop collection and macro as you suggested, but still my process does not work. I cannot figure out why. Could you help?

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
    <context>
    <input/>
    <output/>
    <macros>
    <macro>
    <key>window_length</key>
    <value>5</value>
    </macro>
    </macros>
    </context>
    <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="loop_files" compatibility="7.3.000" expanded="true" height="82" name="Loop Files" width="90" x="313" y="85">
    <parameter key="directory" value="D:\Reports"/>
    <process expanded="true">
    <operator activated="true" class="text:read_document" compatibility="7.3.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="85">
    <parameter key="file" value="D:\Reports_txt\A2A_2008_SR.txt"/>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="7.3.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="85"/>
    <operator activated="true" class="text:aggregate_token_length" compatibility="7.3.000" expanded="true" height="68" name="Aggregate Token Length" width="90" x="313" y="85">
    <parameter key="aggregation" value="count"/>
    </operator>
    <operator activated="true" class="subprocess" compatibility="7.3.000" expanded="true" height="103" name="Subprocess" width="90" x="447" y="85">
    <process expanded="true">
    <operator activated="true" class="multiply" compatibility="7.3.000" expanded="true" height="103" name="Multiply" width="90" x="45" y="34"/>
    <operator activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="85">
    <parameter key="text_attribute" value="text"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="7.3.000" expanded="true" height="68" name="Extract Macro" width="90" x="313" y="187">
    <parameter key="macro" value="number_of_tokens"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="token_length"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="generate_macro" compatibility="7.3.000" expanded="true" height="82" name="Generate Macro" width="90" x="447" y="187">
    <list key="function_descriptions">
    <parameter key="window_length" value="replace(str(eval(%{number_of_tokens})/10),&quot;.0&quot;,&quot;&quot;)"/>
    </list>
    </operator>
    <connect from_port="in 1" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_port="out 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
    <connect from_op="Generate Macro" from_port="through 1" to_port="out 2"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <portSpacing port="sink_out 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:window_document" compatibility="7.3.000" expanded="true" height="68" name="Window Document (2)" width="90" x="514" y="238">
    <parameter key="window_length" value="%{window_length}"/>
    <parameter key="step_size" value="%{window_length}"/>
    <process expanded="true">
    <connect from_port="segment" to_port="document 1"/>
    <portSpacing port="source_segment" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="loop_collection" compatibility="7.3.000" expanded="true" height="82" name="Loop Collection" width="90" x="648" y="238">
    <process expanded="true">
    <operator activated="true" class="set_macro" compatibility="7.3.000" expanded="true" height="82" name="Set Macro (2)" width="90" x="313" y="34">
    <parameter key="macro" value="filenumber"/>
    <parameter key="value" value="0"/>
    </operator>
    <operator activated="true" class="generate_macro" compatibility="6.0.002" expanded="true" height="82" name="Generate Macro (2)" width="90" x="447" y="34">
    <list key="function_descriptions">
    <parameter key="filenumber" value="%{filenumber}+1"/>
    </list>
    </operator>
    <operator activated="true" class="text:write_document" compatibility="7.3.000" expanded="true" height="82" name="Write Document" width="90" x="581" y="34">
    <parameter key="file" value="D:\Reports_parts_txt\%{file_name}_%{filenumber}.txt"/>
    </operator>
    <connect from_port="single" to_op="Set Macro (2)" to_port="through 1"/>
    <connect from_op="Set Macro (2)" from_port="through 1" to_op="Generate Macro (2)" to_port="through 1"/>
    <connect from_op="Generate Macro (2)" from_port="through 1" to_op="Write Document" to_port="document"/>
    <connect from_op="Write Document" from_port="document" to_port="output 1"/>
    <portSpacing port="source_single" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <connect from_port="file object" to_op="Read Document" to_port="file"/>
    <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Aggregate Token Length" to_port="document"/>
    <connect from_op="Aggregate Token Length" from_port="document" to_op="Subprocess" to_port="in 1"/>
    <connect from_op="Subprocess" from_port="out 1" to_op="Window Document (2)" to_port="document"/>
    <connect from_op="Window Document (2)" from_port="documents" to_op="Loop Collection" to_port="collection"/>
    <connect from_op="Loop Collection" from_port="output 1" to_port="out 1"/>
    <portSpacing port="source_file object" spacing="0"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Loop Files" from_port="out 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,453 RM Data Scientist

    Hi ln777,

    i think the attached process should work.. Not tested it though.

     

    ~Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
    <context>
    <input/>
    <output/>
    <macros>
    <macro>
    <key>window_length</key>
    <value>5</value>
    </macro>
    </macros>
    </context>
    <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="loop_files" compatibility="7.2.002" expanded="true" height="82" name="Loop Files" width="90" x="112" y="34">
    <parameter key="directory" value="D:\Reports"/>
    <process expanded="true">
    <operator activated="true" class="text:read_document" compatibility="7.2.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="85">
    <parameter key="file" value="D:\Reports_txt\A2A_2008_SR.txt"/>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="85"/>
    <operator activated="true" class="text:aggregate_token_length" compatibility="7.2.000" expanded="true" height="68" name="Aggregate Token Length" width="90" x="313" y="85">
    <parameter key="aggregation" value="count"/>
    </operator>
    <operator activated="true" class="subprocess" compatibility="7.2.002" expanded="true" height="103" name="Subprocess" width="90" x="447" y="85">
    <process expanded="true">
    <operator activated="true" class="multiply" compatibility="7.2.002" expanded="true" height="103" name="Multiply" width="90" x="45" y="34"/>
    <operator activated="true" class="text:documents_to_data" compatibility="7.2.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="85">
    <parameter key="text_attribute" value="text"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="7.2.002" expanded="true" height="68" name="Extract Macro" width="90" x="313" y="187">
    <parameter key="macro" value="number_of_tokens"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="token_length"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="generate_macro" compatibility="7.2.002" expanded="true" height="82" name="Generate Macro" width="90" x="447" y="187">
    <list key="function_descriptions">
    <parameter key="window_length" value="replace(str(eval(%{number_of_tokens})/10),&quot;.0&quot;,&quot;&quot;)"/>
    </list>
    </operator>
    <connect from_port="in 1" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_port="out 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
    <connect from_op="Generate Macro" from_port="through 1" to_port="out 2"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <portSpacing port="sink_out 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:window_document" compatibility="7.2.000" expanded="true" height="68" name="Window Document (2)" width="90" x="514" y="238">
    <parameter key="window_length" value="%{window_length}"/>
    <parameter key="step_size" value="%{window_length}"/>
    <process expanded="true">
    <connect from_port="segment" to_port="document 1"/>
    <portSpacing port="source_segment" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="loop_collection" compatibility="7.2.002" expanded="true" height="82" name="Loop Collection" width="90" x="648" y="238">
    <process expanded="true">
    <operator activated="false" class="set_macro" compatibility="7.2.002" expanded="true" height="68" name="Set Macro (2)" width="90" x="313" y="34">
    <parameter key="macro" value="filenumber"/>
    <parameter key="value" value="0"/>
    </operator>
    <operator activated="false" class="generate_macro" compatibility="6.0.002" expanded="true" height="68" name="Generate Macro (2)" width="90" x="447" y="34">
    <list key="function_descriptions">
    <parameter key="filenumber" value="%{filenumber}+1"/>
    </list>
    </operator>
    <operator activated="true" class="text:write_document" compatibility="7.2.000" expanded="true" height="82" name="Write Document" width="90" x="581" y="34">
    <parameter key="file" value="D:\Reports_parts_txt\%{file_name}_%{a}.txt"/>
    </operator>
    <connect from_port="single" to_op="Write Document" to_port="document"/>
    <connect from_op="Write Document" from_port="document" to_port="output 1"/>
    <portSpacing port="source_single" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <connect from_port="file object" to_op="Read Document" to_port="file"/>
    <connect from_op="Read Document" from_port="output" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Aggregate Token Length" to_port="document"/>
    <connect from_op="Aggregate Token Length" from_port="document" to_op="Subprocess" to_port="in 1"/>
    <connect from_op="Subprocess" from_port="out 1" to_op="Window Document (2)" to_port="document"/>
    <connect from_op="Window Document (2)" from_port="documents" to_op="Loop Collection" to_port="collection"/>
    <connect from_op="Loop Collection" from_port="output 1" to_port="out 1"/>
    <portSpacing port="source_file object" spacing="0"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Loop Files" from_port="out 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • In777In777 Member Posts: 29 Contributor II

    As always, thank you for the quick help. Unfortunately the process does not work correctly. By "Extract macro" operator I get an error message "The attribute "token length" is missing in the input example set. Besides, by the "window document"-operator I get an error message "Supply value for step size". Do you have any ideas how to fix it?

     

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,453 RM Data Scientist

    Hi,

     

    this sounds odd, because the attached process which is pretty similar except for the loop files works well. Are you sure that your loop has only parsable files in?

     

    ~Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <context>
    <input/>
    <output/>
    <macros>
    <macro>
    <key>window_length</key>
    <value>5</value>
    </macro>
    </macros>
    </context>
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="false" class="loop_files" compatibility="7.3.001" expanded="true" height="68" name="Loop Files" width="90" x="45" y="391">
    <parameter key="directory" value="D:\Reports"/>
    <process expanded="true">
    <operator activated="false" class="text:read_document" compatibility="7.3.000" expanded="true" height="68" name="Read Document" width="90" x="246" y="340">
    <parameter key="file" value="D:\Reports_txt\A2A_2008_SR.txt"/>
    </operator>
    <portSpacing port="source_file object" spacing="0"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:create_document" compatibility="7.3.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="85">
    <parameter key="text" value="Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."/>
    </operator>
    <operator activated="true" class="text:write_document" compatibility="7.3.000" expanded="true" height="82" name="Write Document (2)" width="90" x="45" y="238">
    <parameter key="file" value="C:\Users\Martin\LoremIpsum.txt"/>
    </operator>
    <operator activated="false" class="loop_collection" compatibility="7.3.001" expanded="true" height="82" name="Loop Collection" width="90" x="782" y="85">
    <process expanded="true">
    <operator activated="false" class="set_macro" compatibility="7.3.001" expanded="true" height="68" name="Set Macro (2)" width="90" x="313" y="34">
    <parameter key="macro" value="filenumber"/>
    <parameter key="value" value="0"/>
    </operator>
    <operator activated="false" class="generate_macro" compatibility="6.0.002" expanded="true" height="68" name="Generate Macro (2)" width="90" x="447" y="34">
    <list key="function_descriptions">
    <parameter key="filenumber" value="%{filenumber}+1"/>
    </list>
    </operator>
    <operator activated="true" class="text:write_document" compatibility="7.3.000" expanded="true" height="82" name="Write Document" width="90" x="581" y="34">
    <parameter key="file" value="D:\Reports_parts_txt\%{file_name}_%{a}.txt"/>
    </operator>
    <connect from_port="single" to_op="Write Document" to_port="document"/>
    <connect from_op="Write Document" from_port="document" to_port="output 1"/>
    <portSpacing port="source_single" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" breakpoints="before" class="text:read_document" compatibility="7.3.000" expanded="true" height="68" name="Read Document (2)" width="90" x="179" y="238">
    <parameter key="file" value="C:\Users\Martin\LoremIpsum.txt"/>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="7.3.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="85"/>
    <operator activated="true" class="text:aggregate_token_length" compatibility="7.3.000" expanded="true" height="68" name="Aggregate Token Length" width="90" x="313" y="85">
    <parameter key="aggregation" value="count"/>
    </operator>
    <operator activated="true" class="subprocess" compatibility="7.3.001" expanded="true" height="103" name="Subprocess" width="90" x="447" y="85">
    <process expanded="true">
    <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="45" y="34"/>
    <operator activated="true" class="text:documents_to_data" compatibility="7.3.000" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="85">
    <parameter key="text_attribute" value="text"/>
    </operator>
    <operator activated="true" class="extract_macro" compatibility="7.3.001" expanded="true" height="68" name="Extract Macro" width="90" x="313" y="187">
    <parameter key="macro" value="number_of_tokens"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="token_length"/>
    <parameter key="example_index" value="1"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="generate_macro" compatibility="7.3.001" expanded="true" height="82" name="Generate Macro" width="90" x="447" y="187">
    <list key="function_descriptions">
    <parameter key="window_length" value="replace(str(eval(%{number_of_tokens})/10),&quot;.0&quot;,&quot;&quot;)"/>
    </list>
    </operator>
    <connect from_port="in 1" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_port="out 1"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_op="Extract Macro" to_port="example set"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
    <connect from_op="Generate Macro" from_port="through 1" to_port="out 2"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <portSpacing port="sink_out 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:window_document" compatibility="7.3.000" expanded="true" height="68" name="Window Document (2)" width="90" x="648" y="85">
    <parameter key="window_length" value="%{window_length}"/>
    <parameter key="step_size" value="%{window_length}"/>
    <process expanded="true">
    <connect from_port="segment" to_port="document 1"/>
    <portSpacing port="source_segment" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Write Document (2)" to_port="document"/>
    <connect from_op="Read Document (2)" from_port="output" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Aggregate Token Length" to_port="document"/>
    <connect from_op="Aggregate Token Length" from_port="document" to_op="Subprocess" to_port="in 1"/>
    <connect from_op="Subprocess" from_port="out 1" to_op="Window Document (2)" to_port="document"/>
    <connect from_op="Window Document (2)" from_port="documents" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • In777In777 Member Posts: 29 Contributor II

    Hi Martin,

     

    I still cannot fix the problem with cutting the documents. I tried to create a txt-file with a simple text ("My name is Mary") using the last process you posted here. I got the same error as I before. However, if I use the LoremIpsum.txt the process works fine. I cannot understand why. Maybe you have any solution?

Sign In or Register to comment.