Parsing out plain text from the Reuters RCV1 corpus - XPath, XML

andk · September 2011

I have got a question regarding reading out the node content with xpath from several xml files out. I am fully aware that there are masses of resources on the internet on this issue and please believe me it really drives me crazy. I want to read out information from files from the reuters rcv1 experimental corpus. all the files in this corpus share the same information. i post the structure here as an example:


<?xml version="1.0" encoding="iso-8859-1" ?>
<newsitem itemid="1000000" id="root" date="xxx" xml:lang="en">
<title>title title title</title>
<headline>headline headline headline</headline>
<byline>Jack Daniels</byline>
<dateline>Blabla</dateline>
<text>
<p> Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 </p>
<p> Paragraph 2 Paragraph 2 Paragraph 2 Paragraph 2 Paragraph 2 </p>
<p> Paragraph 3 Paragraph 3 Paragraph 3 Paragraph 3 Paragraph 3 </p>
<p> Paragraph 4 Paragraph 4 Paragraph 4 Paragraph 4 Paragraph 4 </p>
</text>
<copyright>(c) Reuters Limited 1996</copyright>
<metadata>
<codes class="bip:countries:1.0">
  <code code="MEX">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="1996-02-20"/>
  </code>
</codes>
<codes class="bip:topics:1.0">
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="1996-08-20"/>
  </code>
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
  </code>
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
  </code>
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
  </code>
  <code code="xxx">
    <editdetail attribution="Reuters BIP Coding Group" action="confirmed" date="xxx"/>
  </code>
</codes>
<dc element="dc.publisher" value="Reuters Holdings Plc"/>
<dc element="dc.date.published" value="xxx"/>
<dc element="dc.source" value="Reuters"/>
<dc element="dc.creator.location" value="xxx"/>
<dc element="dc.creator.location.country.name" value="xxx"/>
<dc element="dc.source" value="Reuters"/>
</metadata>
</newsitem>

The final goal of my task is to transfer these several thousand files into a table or csv respectively. I am doing this by addressing the different node contents via der xpath address. this is absolutely no problem for all points but one, the content of <text></text>. with //newsitem/text/p/node() he always only delivers the first paragraph. what i would be looking for however would be to extract all the plain text from all paragraphs. this means the csv files should looks approximately like that:

title, headline, date, text, location titleblabla, headlineblabla, xxx, paragraph 1 paragraph 2 paragraph 3, anywhere othertitleblabla, otherheadlineblabla, otherdatexxx, other paragraph 1 paragraph 2 paragraph 3, nowhere

the respective paragraph should thus be collapsed. with the query /newsitem/text i get the whole textbody however with all the tags which is annoying with so many files.

pleas could somebody be so nice how to achieve the described goal via adressing it with xpath. the problem is also that i have to parse out other information too at the same time. thus plain text and attributes should be in the same row of the table.

tank you very much,

a desperate xml/xpath newbie

andk · September 2011

-push- sorry this is against all forum ethics, i know, but i can't believe that nobody here can help me. i am still stuck with the problem. i can read in the file but still i have problems with the /newsitem/text xpath because it always also gives out the tags or when i use /newsitem/text/p/node() only the first paragraph although my xml helptools like xmlspy and aquapath show that this would be the right adress to read out just the plain text.

two add-on questions:

1) i am sure i am not the only one here working with the reuters corpus volume 1 & 2: do you know of a way to read in the whole corpus more efficiently e.g. in a database?

2) although it is just text if also 800k files, rapidminer has enormous memory problems to parse in the files on a 4gm ram machine. is that normal? due to that i had to manually split up the files into six parts with a little more than 600m each. parsing in one of this splits takes around 70 minutes. i suspect that the memory runs full and the system slows down. isn't there away to tell rapidminer to serialize the whole process .... so to read in one file extract the information i need by xpath and write the first result into a csv e.g. other software does this and seems to be much more efficient although unable to achieve what i can do with rapidminer.

every help would be very very much appreciated!

best regards,

amdk

JEdward · September 2011

I can't help with your other 2 questions, but this seems to work for me on your test XML.
Is it what you're after?

JEdward.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
    <process expanded="true" height="280" width="212">
      <operator activated="true" class="read_xml" compatibility="5.1.011" expanded="true" height="60" name="Read XML" width="90" x="45" y="165">
        <parameter key="file" value="test.xml"/>
        <parameter key="xpath_for_examples" value="/newsitem/text"/>
        <enumeration key="xpaths_for_attributes">
          <parameter key="xpath_for_attribute" value="/newsitem/text"/>
        </enumeration>
        <parameter key="use_namespaces" value="false"/>
        <list key="namespaces"/>
        <parameter key="use_default_namespace" value="false"/>
        <list key="annotations"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="text().true.text.attribute"/>
        </list>
      </operator>
      <connect from_op="Read XML" from_port="output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

andk · September 2011

@JEdward

Thank you for your post. But it is really strange because the "Read XML" Operator seems not to exist anymore after I updated Rapidminer today in the morning. Your operator just appears as a dummy. Is this unique to my machine, my set up? Nevertheless, I tried the Read XML operator before and it doesn't work for me as I want to read in 800k xml files instead of just one. therefore i used the process documents from files operator. then the "generate extract" operator to read out the paths i need into a csv file (with the write csv operator). as i said all parts but one work like a charme. the problem is the text part as it is split up by paragraphs. and it seems as this is a problem for rapidminers xpath parser.

JEdward · September 2011

Are you able to post the XML of that part of your process?
I'm using what sounds like the same method on some other XML documents.
If I manage to get a few moments today I'll try to have a quick look at what differences there are between my XML documents & process and yours.

Thanks,
JEdward.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Parsing out plain text from the Reuters RCV1 corpus - XPath, XML

Answers