"writting a collection of documents"

mohammadrezamohammadreza Member Posts: 23 Contributor II
edited June 2019 in Help
Hi all,

I read an XML file in my process and convert it to a collection of documents in memory. Now I need to write each document as a separate file. Is there any way to do that? (I cam think of using "Write Document" in a loop but I can't figure out the right way to do that).

Best

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,505 RM Data Scientist
    Hi mohammedreza,

    what about either Document to Data or Combine Documents first?
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • mohammadrezamohammadreza Member Posts: 23 Contributor II
    Hi Martin,

    The data is already combined in one big XML file so I am trying to break it down to several files and write them. The only remaining part is just writing the document collection (which is in memory) on hard drive: Here is my process so far:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.013">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="read_xml" compatibility="5.3.013" expanded="true" height="60" name="Read XML" width="90" x="45" y="30">
           <parameter key="file" value="C:\home\ebrahimi\Anomaly\seg10Train.xml"/>
           <parameter key="xpath_for_examples" value="conversations/conversation"/>
           <enumeration key="xpaths_for_attributes">
             <parameter key="xpath_for_attribute" value="@id"/&gt;
             <parameter key="xpath_for_attribute" value="message/text"/>
             <parameter key="xpath_for_attribute" value="message/author"/>
           </enumeration>
           <list key="namespaces"/>
           <list key="annotations"/>
           <list key="data_set_meta_data_information"/>
         </operator>
         <operator activated="true" class="text:data_to_documents" compatibility="5.3.002" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="30">
           <parameter key="select_attributes_and_weights" value="true"/>
           <list key="specify_weights">
             <parameter key="tex" value="1.0"/>
           </list>
         </operator>
         <operator activated="true" class="text:filter_documents_by_content" compatibility="5.3.002" expanded="true" height="76" name="Filter Documents (by Content)" width="90" x="313" y="30">
           <parameter key="condition" value="contains match"/>
           <parameter key="regular_expression" value="&lt;/text&gt;&lt;text&gt;"/>
         </operator>
         <operator activated="true" class="loop_collection" compatibility="5.3.013" expanded="true" height="76" name="Loop Collection" width="90" x="447" y="30">
           <process expanded="true">
             <operator activated="true" class="text:write_document" compatibility="5.3.002" expanded="true" height="76" name="Write Document" width="90" x="112" y="30"/>
             <connect from_port="single" to_op="Write Document" to_port="document"/>
             <connect from_op="Write Document" from_port="document" to_port="output 1"/>
             <portSpacing port="source_single" spacing="0"/>
             <portSpacing port="sink_output 1" spacing="0"/>
             <portSpacing port="sink_output 2" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Read XML" from_port="output" to_op="Data to Documents" to_port="example set"/>
         <connect from_op="Data to Documents" from_port="documents" to_op="Filter Documents (by Content)" to_port="documents 1"/>
         <connect from_op="Filter Documents (by Content)" from_port="documents" to_op="Loop Collection" to_port="collection"/>
         <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    Thanks in advance
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    This is a very simple example, but the trick is to pass the Write Document operator a filename, but set that filename using Macros.

    I say it is a simple example as it just uses the iteration of the loop operator as the filename.  I would recommend you use either Extract Macro or Extract Macro from Annotation to get the name of the file you'd like it saved. 
    You might want to try the ID or the Author or a combination of the two? 
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="read_xml" compatibility="6.0.003" expanded="true" height="60" name="Read XML" width="90" x="45" y="30">
            <parameter key="file" value="C:\home\ebrahimi\Anomaly\seg10Train.xml"/>
            <parameter key="xpath_for_examples" value="conversations/conversation"/>
            <enumeration key="xpaths_for_attributes">
              <parameter key="xpath_for_attribute" value="@id"/&gt;
              <parameter key="xpath_for_attribute" value="message/text"/>
              <parameter key="xpath_for_attribute" value="message/author"/>
            </enumeration>
            <list key="namespaces"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
          </operator>
          <operator activated="true" class="text:data_to_documents" compatibility="6.4.001" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="30">
            <parameter key="select_attributes_and_weights" value="true"/>
            <list key="specify_weights">
              <parameter key="tex" value="1.0"/>
            </list>
          </operator>
          <operator activated="true" class="text:filter_documents_by_content" compatibility="6.4.001" expanded="true" height="76" name="Filter Documents (by Content)" width="90" x="313" y="30">
            <parameter key="condition" value="contains match"/>
            <parameter key="regular_expression" value="&lt;/text&gt;&lt;text&gt;"/>
          </operator>
          <operator activated="true" class="loop_collection" compatibility="6.4.000" expanded="true" height="76" name="Loop Collection" width="90" x="447" y="30">
            <parameter key="set_iteration_macro" value="true"/>
            <process expanded="true">
              <operator activated="true" class="text:write_document" compatibility="6.4.001" expanded="true" height="76" name="Write Document" width="90" x="179" y="30">
                <parameter key="file" value="C:\home\ebrahimi\Anomaly\Output\%{iteration}.txt"/>
              </operator>
              <connect from_port="single" to_op="Write Document" to_port="document"/>
              <connect from_op="Write Document" from_port="document" to_port="output 1"/>
              <portSpacing port="source_single" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read XML" from_port="output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Filter Documents (by Content)" to_port="documents 1"/>
          <connect from_op="Filter Documents (by Content)" from_port="documents" to_op="Loop Collection" to_port="collection"/>
          <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • mohammadrezamohammadreza Member Posts: 23 Contributor II
    Hi Edward,

    Thanks. As you correctly mentioned, I need to save each file with its own name (id). According to your explanations (using Extract Macro)  I came up with the following process, But I do not know what to choose for "example index" parameter of "Extract Macro operator" to be the "id" of each file.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.013">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="read_xml" compatibility="5.3.013" expanded="true" height="60" name="Read XML" width="90" x="45" y="30">
           <parameter key="file" value="C:\home\ebrahimi\Anomaly\seg10Train.xml"/>
           <parameter key="xpath_for_examples" value="conversations/conversation"/>
           <enumeration key="xpaths_for_attributes">
             <parameter key="xpath_for_attribute" value="@id"/&gt;
             <parameter key="xpath_for_attribute" value="message/text"/>
             <parameter key="xpath_for_attribute" value="message/author"/>
           </enumeration>
           <list key="namespaces"/>
           <list key="annotations"/>
           <list key="data_set_meta_data_information"/>
         </operator>
         <operator activated="true" class="text:data_to_documents" compatibility="5.3.002" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="30">
           <parameter key="select_attributes_and_weights" value="true"/>
           <list key="specify_weights">
             <parameter key="tex" value="1.0"/>
           </list>
         </operator>
         <operator activated="true" class="text:filter_documents_by_content" compatibility="5.3.002" expanded="true" height="76" name="Filter Documents (by Content)" width="90" x="313" y="30">
           <parameter key="condition" value="contains match"/>
           <parameter key="regular_expression" value="&lt;/text&gt;&lt;text&gt;"/>
         </operator>
         <operator activated="true" class="extract_macro" compatibility="5.3.013" expanded="true" height="60" name="Extract Macro" width="90" x="447" y="30">
           <parameter key="macro_type" value="data_value"/>
           <list key="additional_macros"/>
         </operator>
         <operator activated="true" class="loop_collection" compatibility="5.3.013" expanded="true" height="76" name="Loop Collection" width="90" x="648" y="30">
           <parameter key="set_iteration_macro" value="true"/>
           <process expanded="true">
             <operator activated="true" class="text:write_document" compatibility="5.3.002" expanded="true" height="76" name="Write Document" width="90" x="179" y="30">
               <parameter key="file" value="C:\home\ebrahimi\Anomaly\Output\%{mine}.txt"/>
             </operator>
             <connect from_port="single" to_op="Write Document" to_port="document"/>
             <connect from_op="Write Document" from_port="document" to_port="output 1"/>
             <portSpacing port="source_single" spacing="0"/>
             <portSpacing port="sink_output 1" spacing="0"/>
             <portSpacing port="sink_output 2" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Read XML" from_port="output" to_op="Data to Documents" to_port="example set"/>
         <connect from_op="Data to Documents" from_port="documents" to_op="Filter Documents (by Content)" to_port="documents 1"/>
         <connect from_op="Filter Documents (by Content)" from_port="documents" to_op="Extract Macro" to_port="example set"/>
         <connect from_op="Extract Macro" from_port="example set" to_op="Loop Collection" to_port="collection"/>
         <connect from_op="Loop Collection" from_port="output 1" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,505 RM Data Scientist
    Hi,

    Extract Macro can just be applied on example sets. So you might go with one big loop examples around and then extract the macro before converting it to a document,

    Best
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • In777In777 Member Posts: 29 Contributor II
Sign In or Register to comment.