Split a single xml file into several docs or example set

mohammadreza · February 2015

Hi. I am new to RapidMiner text plugin.

I have an XML file consisting of <document> elements. Each document tag contains one document as follows:

<documents>
    <document>
        <id> 1 </id>
        <text>...............</text>
    </document>
    <document>
        <id> 1 </id>
        <text>...............</text>
    </document>
    ...
</documents>

I think I have to split them first and extract documents to be able to construct the word vector. Is there any way to do that?

MartinLiebig · February 2015

Is there any reason not to use read xml and convert the example set to a document afterwards?

mohammadreza · February 2015

Thanks Martin,

I think read XML operator is the wise option, but I need to do some text classification after that. That's why I wanted to work with documents through text plugin. Assuming that according to your explanation I use Read XML, is this any way to work with text plugin? I mean how should I connect the output of read XML to some operator like "Process Document" or any other operator to allow me do the tokenization, stemming and make word vector?

Thanks

fras · February 2015

Hi, try this as a starting point:


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="6.1.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="75">
        <parameter key="text" value="&lt;documents&gt;&#10;    &lt;document&gt;&#10;        &lt;id&gt; 1 &lt;/id&gt;&#10;        &lt;text&gt; content_A &lt;/text&gt;&#10;    &lt;/document&gt;&#10;    &lt;document&gt;&#10;        &lt;id&gt; 2 &lt;/id&gt;&#10;        &lt;text&gt; content_B &lt;/text&gt;&#10;    &lt;/document&gt;&#10;    ...&#10;&lt;/documents&gt;"/>
        <parameter key="add label" value="true"/>
        <parameter key="label_value" value="SOURCE01"/>
      </operator>
      <operator activated="true" class="text:cut_document" compatibility="6.1.000" expanded="true" height="60" name="Cut Document (10)" width="90" x="112" y="165">
        <parameter key="query_type" value="Regular Region"/>
        <list key="string_machting_queries">
          <parameter key="empty" value="&lt;Family.&lt;/Family&gt;"/>
        </list>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries">
          <parameter key="empty" value="&lt;document.&lt;/document&gt;"/>
        </list>
        <list key="xpath_queries"/>
        <list key="namespaces"/>
        <list key="index_queries"/>
        <list key="jsonpath_queries"/>
        <process expanded="true">
          <connect from_port="segment" to_port="document 1"/>
          <portSpacing port="source_segment" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="loop_collection" compatibility="6.2.000" expanded="true" height="76" name="Loop Collection (2)" width="90" x="246" y="75">
        <parameter key="set_iteration_macro" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:documents_to_data" compatibility="6.1.000" expanded="true" height="76" name="Documents to Data" width="90" x="112" y="75">
            <parameter key="text_attribute" value="text"/>
          </operator>
          <connect from_port="single" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_port="output 1"/>
          <portSpacing port="source_single" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="6.2.000" expanded="true" height="76" name="Append (2)" width="90" x="380" y="75"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="6.1.000" expanded="true" height="76" name="Process Documents from Data (2)" width="90" x="514" y="75">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="keep_text" value="true"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="6.1.000" expanded="true" height="60" name="Tokenize" width="90" x="179" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document (2)" from_port="output" to_op="Cut Document (10)" to_port="document"/>
      <connect from_op="Cut Document (10)" from_port="documents" to_op="Loop Collection (2)" to_port="collection"/>
      <connect from_op="Loop Collection (2)" from_port="output 1" to_op="Append (2)" to_port="example set 1"/>
      <connect from_op="Append (2)" from_port="merged set" to_op="Process Documents from Data (2)" to_port="example set"/>
      <connect from_op="Process Documents from Data (2)" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

mohammadreza · February 2015

Thank you indeed Fras. I will try your solution and let you know about the results ASAP. I think your solution is more efficient if I can adapt it because, I designed the RM process with read XML operator (as Martin suggested), and I ran out of the memory with even a 32 GB of RAM. My XML file is about just 160 MB but the de-serialization process take a lot of RAM in Read XML. So I wanna try your approach and inform you if it could handle my 160MB XML file with the size of 16 0MB. Thanks again.

mohammadreza · February 2015

Hi Fras. I am trying your solution for reading my 160 MB XML fille. I got stuck in dealing with the following XML schema which has more than one <text> node in each document.

<documents>
    <document>
        <id> 1 </id>
        <message>
                <author>..........</author>        
                <text>...............</text>
        </message>
        <message>
               <text>...............</text>
               <text>...............</text>
        </message>
    </document>
    <document>
        <id> 2 </id>
        <message>
                <author>..........</author>        
                <text>...............</text>
        <message>                        
    </document>
    ...
</documents>

In previous solution (Martin's Solution) I used ReadXML operator and set the property "XPath for attribute" to extract all of the <text> nodes for each document. But in the new solution, as you explained, the "Cut Document" operator nicely separates each document and then it is passed through "loop collection" operator. This is where I need to extract all of the <text> nodes in the document (e.g. via XPath). and convert them to one attribute for my example set. But I cannot get all of the <text> nodes for each document. Do you think if there is any solution to do this?

Thanks in advance.

MartinLiebig · February 2015

Hi,

looks to me like a xpath can solve this.
Have you tried the import wizard?

Sadly i got no time to try it myself. But i guess it works

best
Martin

mohammadreza · February 2015

Thanks for the answer Martin; XPATH do solve this problem in "ReadXML" operator. But Read XML cannot handle a 160 MB file. So I am playing around with Fras' solution. And I need to use XPATH in that one. Any idea please?

MartinLiebig · February 2015

the file size should be no problem for read xml.
The wizard might get slow, because it caches the file at some point. But it still works

mohammadreza · February 2015

Hi Martin. That's interesting about ReadXML. But I used it on my 160 MB of XML data and I waited for 2 days and 4 hours (totally 52 hours) on a system with 32GB of memory. After 52 hours, the process was still busy with ReadXML so I stopped it thinking that something is wrong. So do you think that I should have waited more or maybe something is wrong with big files? As an experiment, I splitted the file into several peaces and I got results after 9 hours. In neither of cases I used the import wizard, so I am sure that my XPATH expressions are correct. This experiment might be helpful for others. Please let me know what you think about this experiment.

xmlguy · February 2015

Why not use a tool designed for splitting xml? Over on stackexchange an answer to the following question lists some tools:
http://stackoverflow.com/questions/700213/xml-split-of-a-large-file/7823719#7823719

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Split a single xml file into several docs or example set

Answers