HTML Tag Removal using Regular Expression/Replace Tokens

J_HeringJ_Hering Member Posts: 3 Contributor I
edited November 2018 in Help
Hello friends,

I am faced with a huge txt file containing huge amounts of HTML tags. I want to remove all HTML tags with regular expression using "Replace Tokens" in Rapidminer so I am able to read only pure text.
Since my file is so big (U.S. Securities and Exchange Commission Annual Report text file) I can not even identify all HTML tags within the file.

Due to complex tagging <Tag> <<Tag>Tag> TEXT to extract <Tag> <<Tag>Tag> and due to the fact I do not "see" all tags it is hard for me to find the right regex.

I realised that all text parts basically starts with > (end of Tag) and ends with < (start of new tag).
Is there a regular expression giving me only >Text< since I want to extract only text parts ?

Thanks for your help !!!






Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hey,
    a few comments,

    1. Have you tried the Unescape HTML or Extract Content operators from web mining extension?
    2. Have you considered using Extract Content operator from Aylien? They got a free api for 1000 calls per day.
    3. When i crawled wikipedia some time ago i used something like the attached process. I don't remember exactly what the regexes do. It was back in RM 6.2...

    I hope this helps,
    Martin

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.5.002">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="6.5.002" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" breakpoints="after" class="web:process_web" compatibility="6.5.000" expanded="true" height="60" name="Process Documents from Web" width="90" x="45" y="120">
           <parameter key="url" value="https://fr.wikipedia.org/"/>
           <list key="crawling_rules">
             <parameter key="store_with_matching_url" value=".*"/>
             <parameter key="follow_link_with_matching_url" value=".*"/>
           </list>
           <parameter key="add_pages_as_attribute" value="true"/>
           <parameter key="max_pages" value="100"/>
           <parameter key="domain" value="server"/>
           <parameter key="delay" value="100"/>
           <parameter key="max_threads" value="6"/>
           <process expanded="true">
             <connect from_port="document" to_port="document 1"/>
             <portSpacing port="source_document" spacing="0"/>
             <portSpacing port="sink_document 1" spacing="0"/>
             <portSpacing port="sink_document 2" spacing="0"/>
           </process>
         </operator>
         <operator activated="true" class="replace" compatibility="6.5.002" expanded="true" height="76" name="Replace" width="90" x="246" y="120">
           <parameter key="replace_what" value="(?s)&lt;script.*?&lt;/script&gt;"/>
         </operator>
         <operator activated="true" class="replace" compatibility="6.5.002" expanded="true" height="76" name="Replace (2)" width="90" x="380" y="120">
           <parameter key="replace_what" value="(?s)&lt;style.*?&lt;/style&gt;"/>
         </operator>
         <operator activated="true" class="replace" compatibility="6.5.002" expanded="true" height="76" name="Replace (3)" width="90" x="514" y="120">
           <parameter key="replace_what" value="(?s)&lt;.*?&gt;"/>
         </operator>
         <operator activated="true" class="generate_attributes" compatibility="6.4.000" expanded="true" height="76" name="Generate Attributes" width="90" x="648" y="120">
           <list key="function_descriptions">
             <parameter key="language" value="&quot;de&quot;"/>
           </list>
         </operator>
         <connect from_op="Process Documents from Web" from_port="example set" to_op="Replace" to_port="example set input"/>
         <connect from_op="Replace" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
         <connect from_op="Replace (2)" from_port="example set output" to_op="Replace (3)" to_port="example set input"/>
         <connect from_op="Replace (3)" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
         <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="108"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    What the first two Replace operators Martin's operators are doing in his process is removing things that are enclosed in HTML tags, but you don't want in your text extract.  <Script>...</script> for Javascript code and <style>...</style> for CSS.  The last step is removing all the HTML tags which the operator Replace (3) does. 

    Ta da!  :)
  • ahootanhaahootanha Member Posts: 69 Contributor I

    Hello
    How is the internet link like
    https://t.co/ghtyd
    Delete from text?
    Does anyone know the regular expression?

  • kaymankayman Member Posts: 662 Unicorn

     

    There are a couple of options you have when you want to use regex, but you probably need to do it is several steps to be on the safe side.

    If your structure is indeed like your example (<<Tag>Tag>) one way is to remove the 'correct' tags first by using this regex :

     

    <\/?\w[^<>].*?>

     

    read it a bit like 'select anything starting with a < , optionally followed by a tag closing thingy, then followed by a word character ([a-zA-Z]), then followed by anything but < or > untill the first >'

     

    This will change <Tag> <<Tag>Tag> TEXT to extract <Tag> <<Tag>Tag> into <Tag> TEXT to extract  <Tag>, and if you run the same regex again you will only keep your text.

     

    Now, typically tags should have a closing indicator (</...) but these are missing in your example, so the regex also works for

     

    <Tag> <<Tag>Tag> TEXT to extract </Tag> </</Tag>Tag> or any combination

     

    Anyway, be carefull using regex, if there are actual <> used for greater than / less than instead of html tags you may remove more than needed, but all in all it should allow you to get started. (and kick the guy who created this bad html...)

Sign In or Register to comment.