The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

[SOLVED] xpath

amypuamypu Member Posts: 7 Contributor II
edited November 2018 in Help
Below is an example XML.

<p>
Thisisgood
</p>
<p>
Thisisbad
</p>
<p>
This
<br>
is
<br>
acceptable
</p>
<p>
Thisisfine
</p>

I want the result:
Thisisgood
Thisisbad
Thisisacceptable
Thisisfine

I use Xpath //p/text() in Google Doc (=importXML). Ultimately, I will use //h:p/text() in Rapidminer (with Extract Information operator). This results in:
Thisisgood
Thisisbad
This           is           acceptable (appearing in different cells)
Thisisfine

What XPath would give me the result I need? Thank you.

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Well, what result do you need? :)

    Best regards,
    Marius
  • amypuamypu Member Posts: 7 Contributor II
    I would like to have the following result:

    Thisisgood
    Thisisbad
    Thisisacceptable
    Thisisfine

    I DO NOT want:

    This          is          acceptable (appearing in different cells)

    Thanks.


  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    this is the community forum - for guaranteed answering times please consider to get a support contract. During the holidays our main focus is not on free support :)

    However, let's focus on your issues: which versions of RapidMiner and the Text and Web extension are you using? I can't reproduce the behavior with text in different cells with Extract Information. In the latest versions Extract Information delivers only the first result node, in the case of //h:p/text() that would be "This" in the "this is acceptable" case. This is surely also not what you want. So in your case the proceeding would be to cut the document into its p tags and then extract the content of each p node with Extract Content. Optionally you can then use Replace to remove the spaces.

    Please see the process below for details.

    Best regards,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="112" y="30">
            <parameter key="text" value="&lt;p&gt; &#10;Thisisgood&#10;&lt;/p&gt;&#10;&lt;p&gt; &#10;Thisisbad&#10;&lt;/p&gt;&#10;&lt;p&gt; &#10;This&#10;&lt;br&gt;&#10;is&#10;&lt;br&gt;&#10;acceptable&#10;&lt;/p&gt;&#10;&lt;p&gt; &#10;Thisisfine&#10;&lt;/p&gt;&#10;"/>
          </operator>
          <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document" width="90" x="246" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="p" value="//h:p"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <process expanded="true">
              <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.001" expanded="true" height="60" name="Extract Content" width="90" x="179" y="30">
                <parameter key="minimum_text_block_length" value="1"/>
              </operator>
              <operator activated="false" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="313" y="120">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="result" value=" //h:p/text()"/>
                </list>
                <list key="namespaces"/>
                <list key="index_queries"/>
              </operator>
              <connect from_port="segment" to_op="Extract Content" to_port="document"/>
              <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data" width="90" x="380" y="30">
            <parameter key="text_attribute" value="text"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.