Multi line Feature Extraction, excluding "Quoted" comments

JasonTheKnightJasonTheKnight Member Posts: 2 Contributor I
Hello everyone - I hope this is the right place for my query!

To summarise my request, I am looking at the following page:


I am attempting to get a data file that can be coded with positive / negative which can then be used to train a learner. Obviously this is a very small post, but I'm just doing this as an example!

I have used SplitSegmenter on the page :

<operator name="Root" class="Process" expanded="yes">
    <operator name="SplitSegmenter" class="SplitSegmenter">
        <parameter key="output" value="D:\work\process\split"/>
        <parameter key="split_expression" value="table id=&quot;post"/>
        <parameter key="texts" value="D:\work\process"/>

This gives me various postings in separate files, which is good.

I then need to use Feature Extraction to get the post information into a format that can be loaded into Excel.

<operator name="Root" class="Process" expanded="yes">
    <operator name="FeatureExtraction" class="FeatureExtraction" breakpoints="after">
        <list key="attributes">
          <parameter key="post2" value="//h:div[@class=&amp;quot;postTemplate&quot;]/text()"/>
        <parameter key="default_content_encoding" value="windows-1252"/>
        <parameter key="default_content_language" value="en-gb"/>
        <parameter key="default_content_type" value="htm"/>
        <list key="texts">
          <parameter key="text1" value="D:\Work\process\split"/>

The problem is that the Feature Extraction seems only to work until it finds a <br/>, or another <div> (nested), and then that stops the Feature Extraction. So for the first post I get only "Hi,", the second post only "IMHO - yes. The risks seem weighted on the downside for sterling." and so on. I need to have the entire post from the beginning of the div to the end of the div, whatever else is in there.

This also leads to the question - is there a way to exclude the "quoted" comments (where they've quoted the original post) so that we can prevent the original post contents being read by the SVM multiple times?

Many thanks!



  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Jason,
    I think this is due to your formulation of the XPath querry. You probably have to reformulate it, to get the complete content. As far as I know, /text() only delivers textual content and not subnodes, but I'm not an expert on this domain.

Sign In or Register to comment.