Options

Extract data from XML files

LeiLei Member Posts: 12 Learner I
I have many XML files. They have similar structure but are different in some details. 

The xml structure is similar as follow:

<article>
   <art-front>
       <titlegrp>
           <title>Integrated phytoremediation</title>
       </titlegrp>
       <abstract>
           <p>Phytoremediation is green rehabilitation technology .</p>
       </abstract>
   </art-front>
   <art-body>
       <section>
           <title>One thing</title>
            <p>the main technologies 1...</p>
            <p>the main technologies 2...</p>
       </section>
        <section>
           <title>Others</title>
           <subsect1>
                <p>the main technologies 3...</p>
                <p>the main technologies 4...</p>
                <p>the main technologies 5...</p>
           </subsect1>
       </section> 
   </art-body>
   <art-back>
       <biblist title="References">
            <citauth>
                 <fname>H.</fname>
                 <surname>Ali</surname>
            </citauth>
        </biblist>
   </art-body>
</abstract>

The xml file differences take place between <art-body> and </art-body>. Some xml files have four <section>, some have five...,  the numbers of <p> in <section> tag also can be different. In addition, some xml files have not <subsect> contents, only have multiple <section> contents. 

I want to extract <art-front> and <art-body> contents, but not <art-back> content.

I know that read xml operator can be used to extract content from xml file and also read document operator can finish it. Because my xml files are not totally same, I have no idea to deal with it. Is there any way to do that?

Thanks

Best Answer

  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted
    Hi!

    In these cases I usually build the process with multiple Read XML operators.

    One would extract the common information, e. g. from the constant header. Another the variable information, like the repeating entries. I can then join the results e. g. based on the file name or some other common attribute.

    Use the most specific XPath for selecting what you need in each Read XML and figure out which join is the best for the task. 

    Regards,
    Balázs
Sign In or Register to comment.