Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Cut the document in equal pieces
Hello,
How can I cut the documents in equal parts (e.g. 10 parts) and save them as separate documents. I cannot use regex (an so the cut doc operator), since I have several documents of different format. I tried the window-operator, but I do not quiet understand how to use it. Thank you in advance for any help!
Tagged:
0
Best Answer
-
MartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist
Hi ln777,
the window documents operator should do the job. You might need to extract the number of tokens for every document. For an example, please have a look at the attached process.
Best,
Martin
<?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
<context>
<input/>
<output/>
<macros>
<macro>
<key>window_length</key>
<value>5</value>
</macro>
</macros>
</context>
<operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="7.2.001" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
<parameter key="text" value="Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="7.2.001" expanded="true" height="68" name="Tokenize" width="90" x="313" y="34"/>
<operator activated="false" class="text:process_documents" compatibility="7.2.001" expanded="true" height="82" name="Process Documents" width="90" x="380" y="238">
<process expanded="true">
<operator activated="false" class="text:window_document" compatibility="7.2.001" expanded="true" height="68" name="Window Document" width="90" x="112" y="34">
<parameter key="step_size" value="10"/>
<process expanded="true">
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
</process>
</operator>
<connect from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:aggregate_token_length" compatibility="7.2.001" expanded="true" height="68" name="Aggregate Token Length" width="90" x="447" y="34">
<parameter key="aggregation" value="count"/>
</operator>
<operator activated="true" breakpoints="after" class="subprocess" compatibility="7.3.000" expanded="true" height="103" name="Subprocess" width="90" x="581" y="34">
<process expanded="true">
<operator activated="true" class="multiply" compatibility="7.3.000" expanded="true" height="103" name="Multiply" width="90" x="45" y="34"/>
<operator activated="true" class="text:documents_to_data" compatibility="7.2.001" expanded="true" height="82" name="Documents to Data" width="90" x="179" y="85">
<parameter key="text_attribute" value="text"/>
</operator>
<operator activated="true" class="extract_macro" compatibility="7.3.000" expanded="true" height="68" name="Extract Macro" width="90" x="313" y="187">
<parameter key="macro" value="number_of_tokens"/>
<parameter key="macro_type" value="data_value"/>
<parameter key="attribute_name" value="token_length"/>
<parameter key="example_index" value="1"/>
<list key="additional_macros"/>
</operator>
<operator activated="true" class="generate_macro" compatibility="7.3.000" expanded="true" height="82" name="Generate Macro" width="90" x="447" y="187">
<list key="function_descriptions">
<parameter key="window_length" value="replace(str(eval(%{number_of_tokens})/10),".0","")"/>
</list>
</operator>
<connect from_port="in 1" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_port="out 1"/>
<connect from_op="Multiply" from_port="output 2" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_op="Extract Macro" to_port="example set"/>
<connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
<connect from_op="Generate Macro" from_port="through 1" to_port="out 2"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
<portSpacing port="sink_out 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:window_document" compatibility="7.2.001" expanded="true" height="68" name="Window Document (2)" width="90" x="782" y="34">
<parameter key="window_length" value="%{window_length}"/>
<parameter key="step_size" value="%{window_length}"/>
<process expanded="true">
<connect from_port="segment" to_port="document 1"/>
<portSpacing port="source_segment" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Aggregate Token Length" to_port="document"/>
<connect from_op="Aggregate Token Length" from_port="document" to_op="Subprocess" to_port="in 1"/>
<connect from_op="Subprocess" from_port="out 1" to_op="Window Document (2)" to_port="document"/>
<connect from_op="Window Document (2)" from_port="documents" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany0
Answers
Thank you for the process. It works nicely!
I have also follow-up question to the process. If I put this process into the loop files, how can I write all parts as separate txt.documents to one directory?
Hi,
the trick is the Loop Collection operator to loop over seperate windows. A handy thing is to use the macro %{a} to store it with a the execution_count as a postfix.
~Martin
Dortmund, Germany
Hi Martin,
I used loop collection and macro as you suggested, but still my process does not work. I cannot figure out why. Could you help?
Hi ln777,
i think the attached process should work.. Not tested it though.
~Martin
Dortmund, Germany
As always, thank you for the quick help. Unfortunately the process does not work correctly. By "Extract macro" operator I get an error message "The attribute "token length" is missing in the input example set. Besides, by the "window document"-operator I get an error message "Supply value for step size". Do you have any ideas how to fix it?
Hi,
this sounds odd, because the attached process which is pretty similar except for the loop files works well. Are you sure that your loop has only parsable files in?
~Martin
Dortmund, Germany
Hi Martin,
I still cannot fix the problem with cutting the documents. I tried to create a txt-file with a simple text ("My name is Mary") using the last process you posted here. I got the same error as I before. However, if I use the LoremIpsum.txt the process works fine. I cannot understand why. Maybe you have any solution?