"Intelligent Text Extraction"
this is a very basic question now. I am trying to extract text from various locally stored HTML files. The main structure of the part of the text that I want to extract from each document is similar but not 100% identical. Is there any possibility to define a start text and end text (i.e. 2-3 words that are always at the beginning or end) AND define some "keywords" that must be in-between the start and end text to tell RapidMiner that it is extracting the correct text? The problem that I am encountering at the moment with "Cut Document" and therein the Regular Region Parameter is that the start of my text CAN occur a few times before the actual text part that I really want to have.
<td style=" width:52.50%; text-align:left; " class="ta_10"><span class="ta_11">This is an example Text </span></td>
<td style=" width:100.00%; text-align:left; " class="ta_30"><span class="ta_31">This is an example Text</span></td>
<td style=" width:100.00%; text-align:left; " class="ta_10"><span class="ta_11">Keyword </span></td>
<td style=" width:100.00%; text-align:left; " class="ta_10"><ix:abc contextRef="Hypercube_cfwd_Set1" name="ns:UniqueEndTag" format="ixt2:date" xmlns:ix="http://www.xbrl.org">UniqueEndTag</ix:nonNumeric></td>
So what I need would be the second "This is an example Text" as starting point and all the HTML text down to "Unique End Tag". If I use "Cut Document" I have the problem that I cannot write a regex that distinguishes between the first and second occurence of my starting text as the beginning of each HTML string can be completely different. I would have some unique words that could specify the region that I want to extract (in my example "Keyword". I was playing with the Information Extraction Plugin as I could do some annotation there but I couldn't figure out how this would work on my purpose?
Is there something like a "Intelligent Text Extraction" Operator in RapidMiner? Any other suggestions welcome!:smileyhappy: