Options

Multi line Feature Extraction, excluding "Quoted" comments

JasonTheKnightJasonTheKnight Member Posts: 2 Contributor I
Hello everyone - I hope this is the right place for my query!

To summarise my request, I am looking at the following page:

http://forums.moneysavingexpert.com/showthread.html?t=678451

I am attempting to get a data file that can be coded with positive / negative which can then be used to train a learner. Obviously this is a very small post, but I'm just doing this as an example!

I have used SplitSegmenter on the page :

<operator name="Root" class="Process" expanded="yes">
    <operator name="SplitSegmenter" class="SplitSegmenter">
        <parameter key="output" value="D:\work\process\split"/>
        <parameter key="split_expression" value="table id=&quot;post"/>
        <parameter key="texts" value="D:\work\process"/>
    </operator>
</operator

This gives me various postings in separate files, which is good.

I then need to use Feature Extraction to get the post information into a format that can be loaded into Excel.

<operator name="Root" class="Process" expanded="yes">
    <operator name="FeatureExtraction" class="FeatureExtraction" breakpoints="after">
        <list key="attributes">
          <parameter key="post2" value="//h:div[@class=&amp;quot;postTemplate&quot;]/text()"/>
        </list>
        <parameter key="default_content_encoding" value="windows-1252"/>
        <parameter key="default_content_language" value="en-gb"/>
        <parameter key="default_content_type" value="htm"/>
        <list key="texts">
          <parameter key="text1" value="D:\Work\process\split"/>
        </list>
    </operator>
</operator>


The problem is that the Feature Extraction seems only to work until it finds a <br/>, or another <div> (nested), and then that stops the Feature Extraction. So for the first post I get only "Hi,", the second post only "IMHO - yes. The risks seem weighted on the downside for sterling." and so on. I need to have the entire post from the beginning of the div to the end of the div, whatever else is in there.

This also leads to the question - is there a way to exclude the "quoted" comments (where they've quoted the original post) so that we can prevent the original post contents being read by the SVM multiple times?

Many thanks!

Jason

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Jason,
    I think this is due to your formulation of the XPath querry. You probably have to reformulate it, to get the complete content. As far as I know, /text() only delivers textual content and not subnodes, but I'm not an expert on this domain.

    Greetings,
      Sebastian
Sign In or Register to comment.