"Intelligent Text Extraction"

limegreenman900limegreenman900 Member Posts: 26 Contributor I
edited June 2019 in Help

Hi everyone,

 

this is a very basic question now. I am trying to extract text from various locally stored HTML files. The main structure of the part of the text that I want to extract from each document is similar but not 100% identical. Is there any possibility to define a start text and end text (i.e. 2-3 words that are always at the beginning or end) AND define some "keywords" that must be in-between the start and end text to tell RapidMiner that it is extracting the correct text? The problem that I am encountering at the moment with "Cut Document" and therein the Regular Region Parameter is that the start of my text CAN occur a few times before the actual text part that I really want to have.

 

Example:

<td style=" width:52.50%; text-align:left; " class="ta_10"><span class="ta_11">This is an example Text </span></td>

.....

<td style=" width:100.00%; text-align:left; " class="ta_30"><span class="ta_31">This is an example Text</span></td>

...

<td style=" width:100.00%; text-align:left; " class="ta_10"><span class="ta_11">Keyword </span></td>

...

<td style=" width:100.00%; text-align:left; " class="ta_10"><ix:abc contextRef="Hypercube_cfwd_Set1" name="ns:UniqueEndTag" format="ixt2:date" xmlns:ix="http://www.xbrl.org">UniqueEndTag</ix:nonNumeric></td>

 

So what I need would be the second "This is an example Text" as starting point and all the HTML text down to "Unique End Tag". If I use "Cut Document" I have the problem that I cannot write a regex that distinguishes between the first and second occurence of my starting text as the beginning of each HTML string can be completely different. I would have some unique words that could specify the region that I want to extract (in my example "Keyword". I was playing with the Information Extraction Plugin as I could do some annotation there but I couldn't figure out how this would work on my purpose?

 

Is there something like a "Intelligent Text Extraction" Operator in RapidMiner? Any other suggestions welcome!:smileyhappy:

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,321  RM Data Scientist

    Hi,

     

    this seems tricky. My approach would either be a (tricky) regex or something like HTML to XML and then Process XSLT?

     

     

    ~Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • limegreenman900limegreenman900 Member Posts: 26 Contributor I

    Hi Martin,

     

    I know, normally a RegEx would be the best solution if I would have some structure where I could distinguish between my different start texts, however I don't know whether a very complex regex that contains multiline forward and backlooking features will run into performance issues as I have a lot of documents....

     

    For XSLT I doubt that it would work as my text has no unique tags but it randomly formatted with inline <span> classes which do not have to contain similar attributes...

     

    To get back to my originally question: Are you aware of any operator within the IE plugin that could adress this problem? Or is this really something that I will have to do with "Cut Documents" and the Regular Region Parameter?

     

     

  • limegreenman900limegreenman900 Member Posts: 26 Contributor I

    Or is there any possibility that I could extract one text as a reference and "train" RapidMiner to detect this part in all other files due to high similarity?

Sign In or Register to comment.