Regexpression for html content extraction

mike075i · April 2018

Hi guys, I have an HTML page and want to extract after a specific <h2> tag all the content followed by the <p> tag.

I am using the Extract Information component and the Regular Expression as query/type. I have tried to extract the

content of the <h2> tag (regex: <h2>(.+?)</h2>) which gives me the right result Specific 1 text (HTML snipped is listed below).

But when I am trying to extract the <p>blabla...</p> content after this specific <h2> tag using

regex: <h2>Specific 1</h2><p>(.+?)</p> that doesn't work.

...

<h2>Specific 1</h2>

<p>blablabla...</p>

...

Can someonte tell me why and what the right regex is to get the <p> content?

Thank you

mike075i · April 2018

Hello, I have solved the problem myself all the problem was that I had to add the h: statement before the HTML tags in the XPath query. The solution is related to this post https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/XPath-with-quot-Cut-Document-quot-or-quot-Extract-Information/td-p/45582.

Telcontar120 · April 2018

Can you post your html file? The expression you've given seems like it should work but it is hard to tell or test without a data sample.

mike075i · April 2018

This was only an example. I have attached the whole HTML document which contains the policies of Google in different languages (for simplicity I have attached the English one) in txt format, because of the upload conditions of file extensions I have changed it from .html to .txt. Below is the <p>...</p> part listed which I want to extract after the <h2> tag:

<h2 id="infocollect">Information we collect</h2>
<p>We collect information to provide better services to all of our users – from figuring out basic stuff like which language you speak, to more complex things like which <a class="highlight" href="../../../../policies/privacy/example/ads-youll-find-most-useful.html" id="ads-youll-find-most-useful">ads you’ll find most useful</a>, <a class="highlight" href="../../../../policies/privacy/example/the-people-who-matter-most.html" id="the-people-who-matter-most">the people who matter most to you online</a>, or which YouTube videos you might like.
</p>

kayman · April 2018

Not sure if you will be able to manage this with regex, xpath might be a better candidate for your needs.
But if there is only one match in your html this may work :

(?s)^.*?<h2 id="infocollect".*?<\/h2>\s*<p>(.*?)<\/p>.*$

(read as : start at the beginning of the file, do not stop at linebreaks, untill you find the first h2 with id="infocollect", next take the content in the following p tag and store that, then ignore everything again till the end of the page.)

So replacing with $1 gives just the p tag content.

mike075i · April 2018

Thank you, but the same issue all the content in the attribute is marked as ?. You are right that XPath is the main choice but I don't have much time to learn XPath now . In addition, I am getting every time while I am executing using the Regex this error message (example for danish language):

Telcontar120 · April 2018

@sgenzer are you able to read this text file? I can open it in Notepad++ and it looks fine and says it is encoded UTF-8, but when I try to read it in RapidMiner, it comes back with unreadable characters (both using System encoding as well as UTF-8). I feel like there was another thread with this problem recently, but now I can't find it. Is this another known bug? Or is there some other encoding setting that I am missing somewhere? Thanks!

sgenzer · April 2018

hi @Telcontar120 yes I can read this file fine. However I cannot see the </p> tag on that text file so I did the RegEx including a small snippet of the next piece.

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000-BETA2">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.000-BETA2" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:read_document" compatibility="7.5.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="85">
        <parameter key="file" value="/Users/genzerconsulting/Desktop/45.txt"/>
      </operator>
      <operator activated="true" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace Tokens" width="90" x="179" y="85">
        <list key="replace_dictionary">
          <parameter key="\n" value=" "/>
        </list>
      </operator>
      <operator activated="true" class="text:keep_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Keep Document Parts" width="90" x="313" y="85">
        <parameter key="extraction_regex" value="[&lt;]h2 id[=][&quot;]infocollect.*[&lt;]p[&gt;]We collect"/>
      </operator>
      <connect from_op="Read Document" from_port="output" to_op="Replace Tokens" to_port="document"/>
      <connect from_op="Replace Tokens" from_port="document" to_op="Keep Document Parts" to_port="document"/>
      <connect from_op="Keep Document Parts" from_port="document" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Scott

mike075i · April 2018

I believed that XPath is something like a new programming language that's why I wrote that I have not much time to learn it, but it is not so and it has an easy syntax to find the right elements in the DOM structure. XPath is by far the better solution but I had no experience before with it. In addition, Chropath for chrome is an awesome extension to check for the right path. Thank you.

mike075i · April 2018

I have tried to extract the <p> content after the <h2 class="H8KnQb" id="infocollect">Τα στοιχεία που συλλέγουμε</h2> tag using the XPath query: //h2[@id='infocollect']/following-sibling::p[1] in the Extract Information component, but the problem remains in the output. As you can see in the below screenshot the content gets extracted right using the XPath query in ChroPath.

I have added in addition the Extract Content operator to exclude the HMTL tags and get only the text which starts as Συλλέγουμε στοιχεία, για να παρέχουμε καλύτερες υπηρεσίες σε όλους τους χρήστες μας. Here is my XML code maybe you can help me to fix this problem:

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="187">
        <parameter key="url" value="https://www.google.gr/intl/el/policies/privacy/archive/"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".+privacy/archive.+"/>
        </list>
      </operator>
      <operator activated="true" class="concurrency:loop_values" compatibility="8.1.001" expanded="true" height="82" name="Loop Values" width="90" x="380" y="187">
        <parameter key="attribute" value="Link"/>
        <parameter key="iteration_macro" value="link"/>
        <process expanded="true">
          <operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="34">
            <parameter key="url" value="%{link}"/>
            <list key="query_parameters"/>
            <list key="request_properties"/>
            <parameter key="override_encoding" value="true"/>
          </operator>
          <connect from_op="Get Page" from_port="output" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="715" y="34">
        <process expanded="true">
          <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information (3)" width="90" x="246" y="85">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries">
              <parameter key="att" value="\.&lt;h2 id=&quot;infocollect&quot;&gt;Τα στοιχεία που συλλέγουμε&lt;/h2&gt;\.."/>
            </list>
            <list key="regular_expression_queries">
              <parameter key="att" value="&lt;h2 id=&quot;infocollect&quot;&gt;(.+?)&lt;/h2&gt;"/>
            </list>
            <list key="regular_region_queries">
              <parameter key="att" value="\.&lt;h2 id=&quot;infocollect&quot;&gt;Τα στοιχεία που συλλέγουμε&lt;/h2&gt;\..\.&lt;/p&gt;\."/>
            </list>
            <list key="xpath_queries">
              <parameter key="att" value="//h2[@id='infocollect']/following-sibling::p[1]"/>
            </list>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="514" y="85"/>
          <connect from_port="document" to_op="Extract Information (3)" to_port="document"/>
          <connect from_op="Extract Information (3)" from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Crawl Web" from_port="example set" to_op="Loop Values" to_port="input 1"/>
      <connect from_op="Loop Values" from_port="output 1" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I am using RapidMiner Studio version 8.1.001 Win64 platform

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Regexpression for html content extraction

Best Answer

Answers