Regexpression for html content extraction

mike075imike075i Member Posts: 11 Contributor II
edited December 2018 in Help

Hi guys, I have an HTML page and want to extract after a specific <h2> tag all the content followed by the <p> tag.

I am using the Extract Information component and the Regular Expression as query/type. I have tried to extract the

content of the <h2> tag (regex: <h2>(.+?)</h2>) which gives me the right result Specific 1 text (HTML snipped is listed below).

But when I am trying to extract the <p>blabla...</p> content after this specific <h2> tag using

regex: <h2>Specific 1</h2><p>(.+?)</p> that doesn't work.

...

<h2>Specific 1</h2>

<p>blablabla...</p>

...

 

Can someonte tell me why and what the right regex is to get the <p> content?

 

Thank you

Tagged:

Best Answer

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,172   Unicorn

    Can you post your html file?  The expression you've given seems like it should work but it is hard to tell or test without a data sample.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • mike075imike075i Member Posts: 11 Contributor II

    This was only an example. I have attached the whole HTML document which contains the policies of Google in different languages (for simplicity I have attached the English one) in txt format, because of the upload conditions of file extensions I have changed it from .html to .txt. Below is the <p>...</p> part listed which I want to extract after the <h2> tag:

     

     

    <h2 id="infocollect">Information we collect</h2>
    <p>We collect information to provide better services to all of our users – from figuring out basic stuff like which language you speak, to more complex things like which <a class="highlight" href="../../../../policies/privacy/example/ads-youll-find-most-useful.html" id="ads-youll-find-most-useful">ads you’ll find most useful</a>, <a class="highlight" href="../../../../policies/privacy/example/the-people-who-matter-most.html" id="the-people-who-matter-most">the people who matter most to you online</a>, or which YouTube videos you might like.
    </p>

     

    45.txt 48.9K
  • kaymankayman Member Posts: 340   Unicorn

    Not sure if you will be able to manage this with regex, xpath might be a better candidate for your needs. 
    But if there is only one match in your html this may work :

     

    (?s)^.*?<h2 id="infocollect".*?<\/h2>\s*<p>(.*?)<\/p>.*$

     

     

    (read as : start at the beginning of the file, do not stop at linebreaks, untill you find the first h2 with id="infocollect", next take the content in the following p tag and store that, then ignore everything again till the end of the page.)

     

    So replacing with $1 gives just the p tag content.

     

     

     

     

  • mike075imike075i Member Posts: 11 Contributor II

    Thank you, but the same issue all the content in the attribute is marked as ?. You are right that XPath is the main choice but I don't have much time to learn XPath now :(. In addition, I am getting every time while I am executing using the Regex this error message (example for danish language):

     

    error.JPG

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,172   Unicorn

    @sgenzer are you able to read this text file? I can open it in Notepad++ and it looks fine and says it is encoded UTF-8, but when I try to read it in RapidMiner, it comes back with unreadable characters (both using System encoding as well as UTF-8).  I feel like there was another thread with this problem recently, but now I can't find it.  Is this another known bug?  Or is there some other encoding setting that I am missing somewhere?  Thanks!

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,351  Community Manager

    hi @Telcontar120 yes I can read this file fine. However I cannot see the </p> tag on that text file so I did the RegEx including a small snippet of the next piece.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000-BETA2">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000-BETA2" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:read_document" compatibility="7.5.000" expanded="true" height="68" name="Read Document" width="90" x="45" y="85">
    <parameter key="file" value="/Users/genzerconsulting/Desktop/45.txt"/>
    </operator>
    <operator activated="true" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace Tokens" width="90" x="179" y="85">
    <list key="replace_dictionary">
    <parameter key="\n" value=" "/>
    </list>
    </operator>
    <operator activated="true" class="text:keep_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Keep Document Parts" width="90" x="313" y="85">
    <parameter key="extraction_regex" value="[&lt;]h2 id[=][&quot;]infocollect.*[&lt;]p[&gt;]We collect"/>
    </operator>
    <connect from_op="Read Document" from_port="output" to_op="Replace Tokens" to_port="document"/>
    <connect from_op="Replace Tokens" from_port="document" to_op="Keep Document Parts" to_port="document"/>
    <connect from_op="Keep Document Parts" from_port="document" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

     

  • mike075imike075i Member Posts: 11 Contributor II

    I believed that XPath is something like a new programming language that's why I wrote that I have not much time to learn it, but it is not so and it has an easy syntax to find the right elements in the DOM structure. XPath is by far the better solution but I had no experience before with it. In addition, Chropath for chrome is an awesome extension to check for the right path. Thank you.

  • mike075imike075i Member Posts: 11 Contributor II

    I have tried to extract the <p> content after the <h2 class="H8KnQb" id="infocollect">Τα στοιχεία που συλλέγουμε</h2> tag using the XPath query: //h2[@id='infocollect']/following-sibling::p[1] in the Extract Information component, but the problem remains in the output. As you can see in the below screenshot the content gets extracted right using the XPath query in ChroPath.

    xpath.JPG

    I have added in addition the Extract Content operator to exclude the HMTL tags and get only the text which starts as Συλλέγουμε στοιχεία, για να παρέχουμε καλύτερες υπηρεσίες σε όλους τους χρήστες μας. Here is my XML code maybe you can help me to fix this problem:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="187">
    <parameter key="url" value="https://www.google.gr/intl/el/policies/privacy/archive/"/>
    <list key="crawling_rules">
    <parameter key="store_with_matching_url" value=".+privacy/archive.+"/>
    </list>
    </operator>
    <operator activated="true" class="concurrency:loop_values" compatibility="8.1.001" expanded="true" height="82" name="Loop Values" width="90" x="380" y="187">
    <parameter key="attribute" value="Link"/>
    <parameter key="iteration_macro" value="link"/>
    <process expanded="true">
    <operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="34">
    <parameter key="url" value="%{link}"/>
    <list key="query_parameters"/>
    <list key="request_properties"/>
    <parameter key="override_encoding" value="true"/>
    </operator>
    <connect from_op="Get Page" from_port="output" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="715" y="34">
    <process expanded="true">
    <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information (3)" width="90" x="246" y="85">
    <parameter key="query_type" value="XPath"/>
    <list key="string_machting_queries">
    <parameter key="att" value="\.&lt;h2 id=&quot;infocollect&quot;&gt;Τα στοιχεία που συλλέγουμε&lt;/h2&gt;\.."/>
    </list>
    <list key="regular_expression_queries">
    <parameter key="att" value="&lt;h2 id=&quot;infocollect&quot;&gt;(.+?)&lt;/h2&gt;"/>
    </list>
    <list key="regular_region_queries">
    <parameter key="att" value="\.&lt;h2 id=&quot;infocollect&quot;&gt;Τα στοιχεία που συλλέγουμε&lt;/h2&gt;\..\.&lt;/p&gt;\."/>
    </list>
    <list key="xpath_queries">
    <parameter key="att" value="//h2[@id='infocollect']/following-sibling::p[1]"/>
    </list>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    </operator>
    <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="514" y="85"/>
    <connect from_port="document" to_op="Extract Information (3)" to_port="document"/>
    <connect from_op="Extract Information (3)" from_port="document" to_op="Extract Content" to_port="document"/>
    <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Crawl Web" from_port="example set" to_op="Loop Values" to_port="input 1"/>
    <connect from_op="Loop Values" from_port="output 1" to_op="Process Documents" to_port="documents 1"/>
    <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

     

    I am using RapidMiner Studio version 8.1.001 Win64 platform

      

     

Sign In or Register to comment.