"Intelligent Text Extraction"

limegreenman900limegreenman900 Member Posts: 26 Contributor II
edited June 2019 in Help

Hi everyone,


this is a very basic question now. I am trying to extract text from various locally stored HTML files. The main structure of the part of the text that I want to extract from each document is similar but not 100% identical. Is there any possibility to define a start text and end text (i.e. 2-3 words that are always at the beginning or end) AND define some "keywords" that must be in-between the start and end text to tell RapidMiner that it is extracting the correct text? The problem that I am encountering at the moment with "Cut Document" and therein the Regular Region Parameter is that the start of my text CAN occur a few times before the actual text part that I really want to have.



<td style=" width:52.50%; text-align:left; " class="ta_10"><span class="ta_11">This is an example Text </span></td>


<td style=" width:100.00%; text-align:left; " class="ta_30"><span class="ta_31">This is an example Text</span></td>


<td style=" width:100.00%; text-align:left; " class="ta_10"><span class="ta_11">Keyword </span></td>


<td style=" width:100.00%; text-align:left; " class="ta_10"><ix:abc contextRef="Hypercube_cfwd_Set1" name="ns:UniqueEndTag" format="ixt2:date" xmlns:ix="http://www.xbrl.org">UniqueEndTag</ix:nonNumeric></td>


So what I need would be the second "This is an example Text" as starting point and all the HTML text down to "Unique End Tag". If I use "Cut Document" I have the problem that I cannot write a regex that distinguishes between the first and second occurence of my starting text as the beginning of each HTML string can be completely different. I would have some unique words that could specify the region that I want to extract (in my example "Keyword". I was playing with the Information Extraction Plugin as I could do some annotation there but I couldn't figure out how this would work on my purpose?


Is there something like a "Intelligent Text Extraction" Operator in RapidMiner? Any other suggestions welcome!:smileyhappy:


  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,520 RM Data Scientist



    this seems tricky. My approach would either be a (tricky) regex or something like HTML to XML and then Process XSLT?




    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    limegreenman900limegreenman900 Member Posts: 26 Contributor II

    Hi Martin,


    I know, normally a RegEx would be the best solution if I would have some structure where I could distinguish between my different start texts, however I don't know whether a very complex regex that contains multiline forward and backlooking features will run into performance issues as I have a lot of documents....


    For XSLT I doubt that it would work as my text has no unique tags but it randomly formatted with inline <span> classes which do not have to contain similar attributes...


    To get back to my originally question: Are you aware of any operator within the IE plugin that could adress this problem? Or is this really something that I will have to do with "Cut Documents" and the Regular Region Parameter?



  • Options
    limegreenman900limegreenman900 Member Posts: 26 Contributor II

    Or is there any possibility that I could extract one text as a reference and "train" RapidMiner to detect this part in all other files due to high similarity?

Sign In or Register to comment.