"Any tips on optimizing the Read XML operator?"

JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
edited June 2019 in Help
Hi,

I've a rather lengthy process that at one point reads an XML file using the ReadXML operator and I've found this is a bottle neck in execution speed. 
The XML file is only 1,000 records in total with 20 attributes of which the operator only extracts 4 of these fields.  Yet it takes around 1minute 30 seconds to run each time.  (Doesn't sound like much, but it's going to loop over several hundred of these files)

Are there any tips on speeding up execution time of this operator? 
Would it help if I turned off Parse Numbers, Read not matching values as missings or changed data management from double_array to a different value? 
Tagged:

Answers

  • frasfras Member Posts: 93 Contributor II
    This is a known issue of the current implementation of the operator Read XML.
    One possible workaround could be to split the XML into several pieces.
    The following process is only an Example and will not fit to your needs:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.0.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="subprocess" compatibility="6.0.003" expanded="true" height="76" name="read XML" width="90" x="45" y="30">
            <process expanded="true">
              <operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Read Document (3)" width="90" x="45" y="30">
                <parameter key="file" value="%{path}%{file}"/>
                <parameter key="extract_text_only" value="false"/>
                <parameter key="encoding" value="UTF-8"/>
              </operator>
              <operator activated="true" class="text:replace_tokens" compatibility="5.3.002" expanded="true" height="60" name="Replace Tokens (3)" width="90" x="180" y="30">
                <list key="replace_dictionary">
                  <parameter key="(&lt;Family )" value="&lt;&lt;&gt;&gt;$1"/>
                </list>
              </operator>
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (3)" width="90" x="313" y="30">
                <parameter key="mode" value="regular expression"/>
                <parameter key="expression" value="&lt;&lt;&gt;&gt;"/>
              </operator>
              <operator activated="true" class="text:window_document" compatibility="5.3.002" expanded="true" height="60" name="Window Document (3)" width="90" x="450" y="30">
                <parameter key="window_length" value="1"/>
                <process expanded="true">
                  <connect from_port="segment" to_port="document 1"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="loop_collection" compatibility="6.0.003" expanded="true" height="76" name="Loop Collection (3)" width="90" x="585" y="30">
                <parameter key="set_iteration_macro" value="true"/>
                <process expanded="true">
                  <operator activated="true" class="branch" compatibility="6.0.003" expanded="true" height="76" name="Branch (4)" width="90" x="313" y="30">
                    <parameter key="condition_type" value="expression"/>
                    <parameter key="condition_value" value="%{iteration}!=1"/>
                    <process expanded="true">
                      <operator activated="true" class="text:write_document" compatibility="5.3.002" expanded="true" height="76" name="Write Document (3)" width="90" x="45" y="30">
                        <parameter key="encoding" value="UTF-8"/>
                      </operator>
                      <operator activated="true" class="handle_exception" compatibility="6.0.003" expanded="true" height="94" name="Handle Exception (3)" width="90" x="185" y="39">
                        <process expanded="true">
                          <operator activated="true" class="read_xml" compatibility="6.0.003" expanded="true" height="60" name="Read XML" width="90" x="112" y="30">
                            <enumeration key="xpaths_for_attributes"/>
                            <list key="namespaces"/>
                            <list key="annotations"/>
                            <list key="data_set_meta_data_information"/>
                          </operator>
                          <connect from_port="in 1" to_op="Read XML" to_port="file"/>
                          <connect from_op="Read XML" from_port="output" to_port="out 1"/>
                          <portSpacing port="source_in 1" spacing="0"/>
                          <portSpacing port="source_in 2" spacing="0"/>
                          <portSpacing port="sink_out 1" spacing="0"/>
                          <portSpacing port="sink_out 2" spacing="0"/>
                          <portSpacing port="sink_out 3" spacing="0"/>
                        </process>
                        <process expanded="true">
                          <operator activated="true" class="print_to_console" compatibility="6.0.003" expanded="true" height="76" name="Print to Console (5)" width="90" x="124" y="39">
                            <parameter key="log_value" value="XML Error in element %{iteration}"/>
                          </operator>
                          <connect from_port="in 1" to_op="Print to Console (5)" to_port="through 1"/>
                          <connect from_op="Print to Console (5)" from_port="through 1" to_port="out 2"/>
                          <portSpacing port="source_in 1" spacing="0"/>
                          <portSpacing port="source_in 2" spacing="0"/>
                          <portSpacing port="sink_out 1" spacing="0"/>
                          <portSpacing port="sink_out 2" spacing="0"/>
                          <portSpacing port="sink_out 3" spacing="0"/>
                        </process>
                      </operator>
                      <connect from_port="condition" to_op="Write Document (3)" to_port="document"/>
                      <connect from_op="Write Document (3)" from_port="file" to_op="Handle Exception (3)" to_port="in 1"/>
                      <connect from_op="Handle Exception (3)" from_port="out 1" to_port="input 1"/>
                      <portSpacing port="source_condition" spacing="0"/>
                      <portSpacing port="source_input 1" spacing="0"/>
                      <portSpacing port="sink_input 1" spacing="0"/>
                      <portSpacing port="sink_input 2" spacing="0"/>
                    </process>
                    <process expanded="true">
                      <portSpacing port="source_condition" spacing="0"/>
                      <portSpacing port="source_input 1" spacing="0"/>
                      <portSpacing port="sink_input 1" spacing="0"/>
                      <portSpacing port="sink_input 2" spacing="0"/>
                    </process>
                  </operator>
                  <connect from_port="single" to_op="Branch (4)" to_port="condition"/>
                  <connect from_op="Branch (4)" from_port="input 1" to_port="output 1"/>
                  <portSpacing port="source_single" spacing="0"/>
                  <portSpacing port="sink_output 1" spacing="0"/>
                  <portSpacing port="sink_output 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="append" compatibility="6.0.003" expanded="true" height="76" name="Append" width="90" x="581" y="165"/>
              <connect from_op="Read Document (3)" from_port="output" to_op="Replace Tokens (3)" to_port="document"/>
              <connect from_op="Replace Tokens (3)" from_port="document" to_op="Tokenize (3)" to_port="document"/>
              <connect from_op="Tokenize (3)" from_port="document" to_op="Window Document (3)" to_port="document"/>
              <connect from_op="Window Document (3)" from_port="documents" to_op="Loop Collection (3)" to_port="collection"/>
              <connect from_op="Loop Collection (3)" from_port="output 1" to_op="Append" to_port="example set 1"/>
              <connect from_op="Append" from_port="merged set" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="read XML" from_port="out 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Thanks Freo, that's a good help.

    :D
Sign In or Register to comment.