"Information Retrieval an weighting by html tags"

simon_knoll · November 2009

Hello,
is there a possibillity within rapidminer to weight extracted terms by the html or xml tags where they are entailed?

Example:
"<h1>Stock Quotes</h1>"

is rated higher than

"<h4>Phone number</h4>"

regards
Simon Knoll

land · November 2009

Hi Simon,
I included it today into the new TextProcessing Extension of RapidMiner 5. The current Plugin does not support this, so you might wait until we release RapidMiner 5...

Greetings,
Sebastian

simon_knoll · February 2010

Hello Sebastian,
could you do me a favor and show me a short example, how i can apply weight for html tags or which operators i need?

regards,
Simon Knoll

land · February 2010

Hi Simon,
the Text Processing Extension contains Operators for extracting XPath querries. It's called Generate Extract. If you have stored the contents of a web page in an ExampleSet, you might use this operator to extract the content of a h4 tag as a new attribute. If you take a look at the current version of the Process Documents from Data operator, it allows you to select attributes from where the text should be taken. In this list, you can also assign a weight to each attribute. Combining these two things should suit your needs.
If this does not proof helpful, we could think of implementing some sort of weight applier, that will assing weights on tokens if it fulfills some condition.

Greetings,
Sebastian

simon_knoll · February 2010

thanks for your help.

but i've got some problems with the "generate extract" operator. more precise, im not getting any results, furthermore im getting empty results :-)
maybe im using it in the wrong way


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <parameter key="parallelize_main_process" value="true"/>
    <process expanded="true" height="746" width="1091">
      <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="313" y="165">
        <parameter key="text" value="&lt;html&gt;&#10;&lt;title&gt;Hallo Titel&lt;/title&gt;&#10;&lt;h4&gt;Hallo Überschrift 3&lt;/h4&gt;&#10;&lt;h3&gt;Hallo Überschrift 3&lt;/h3&gt;&#10;&lt;p&gt;&lt;h4&gt;Ein H4&lt;/h4&gt; &lt;span&gt;in einem Paragraph&lt;/span&gt;&lt;/p&gt;&#10;&lt;/html&gt;"/>
      </operator>
      <operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents" width="90" x="581" y="75">
        <process expanded="true" height="724" width="770">
          <connect from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="782" y="75">
        <parameter key="source_attribute" value="source_ATTR"/>
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="title_html" value="//h:title/text()"/>
        </list>
        <list key="namespaces"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Generate Extract" to_port="Example Set"/>
      <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

regards,
simon

land · March 2010

Hi Simon,
the problem with your setup is, that the source attribute does not exists. My problem with that is, that the operator does not complain about this, but instead simply doesn't deliver anything. I changed that behavior...
For getting the text into an attribute, you can uncheck the create_word_vector parameter in the Process Document and instead add Keep_text. Then a new attribute called text will be added containing the text. You can select this for the generate extract operator and then it works as below:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <parameter key="parallelize_main_process" value="true"/>
    <process expanded="true" height="746" width="1091">
      <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="112" y="75">
        <parameter key="text" value="&lt;html&gt;&#10;&lt;title&gt;Hallo Titel&lt;/title&gt;&#10;&lt;h4&gt;Hallo Überschrift 3&lt;/h4&gt;&#10;&lt;h3&gt;Hallo Überschrift 3&lt;/h3&gt;&#10;&lt;p&gt;&lt;h4&gt;Ein H4&lt;/h4&gt; &lt;span&gt;in einem Paragraph&lt;/span&gt;&lt;/p&gt;&#10;&lt;/html&gt;"/>
      </operator>
      <operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents" width="90" x="246" y="75">
        <parameter key="create_word_vector" value="false"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true" height="724" width="770">
          <connect from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="380" y="75">
        <parameter key="source_attribute" value="text"/>
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="title_html" value="//h:title/text()"/>
        </list>
        <list key="namespaces"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Generate Extract" to_port="Example Set"/>
      <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Greetings,
Sebastian

simon_knoll · March 2010

thanks, now it works also for me. but still i got some questions


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <parameter key="logverbosity" value="3"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="1"/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="parallelize_main_process" value="false"/>
    <process expanded="true" height="629" width="950">
      <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="112" y="255">
        <parameter key="text" value="&lt;html&gt;&#10;&#9;&lt;a href=&quot;1&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;2&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;3&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;4&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;5&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;6&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;7&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;8&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;9&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;0&quot;&gt;Details&lt;/a&gt;&#10;&lt;/html&gt;&#10;"/>
        <parameter key="add label" value="false"/>
        <parameter key="label_type" value="0"/>
      </operator>
      <operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents" width="90" x="447" y="255">
        <parameter key="create_word_vector" value="false"/>
        <parameter key="vector_creation" value="0"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="0"/>
        <parameter key="prunde_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="5.0"/>
        <parameter key="prune_above_rank" value="5.0"/>
        <parameter key="datamanagement" value="7"/>
        <parameter key="parallelize_vector_creation" value="false"/>
        <process expanded="true" height="629" width="950">
          <connect from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="648" y="255">
        <parameter key="source_attribute" value="text"/>
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <parameter key="attribute_type" value="Nominal"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="DetailsPage" value="//h:a[text()='Details']/@href"/&gt;
        </list>
        <list key="namespaces"/>
        <parameter key="ignore_CDATA" value="true"/>
        <parameter key="assume_html" value="true"/>
        <parameter key="value_seperator" value=";"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Generate Extract" to_port="Example Set"/>
      <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

why im gettin' here just one result and not every href entry seperated by ";"
regards
simon

land · March 2010

Hi Simon,
as the operator documentation tries to say, if a query results in an enumeration of items like for example "en,de,fr", then this values are separated using the given characters. But anyway you have to enter the exact search expression more than once to specify more than one attribute name. Where should the operator store the second value, if you enter only one attribute?

Greetings,
Sebastian

simon_knoll · March 2010

hello sebastian,
unfortunatly i dont understand your suggestion. so what i want to achive is following:
having this "html" code


<html>
	<a href="1">Details</a>
	<a href="2">Details</a>
	<a href="3">Details</a>
	<a href="4">Details</a>
	<a href="5">Details</a>
	<a href="6">Details</a>
	<a href="7">Details</a>
	<a href="8">Details</a>
	<a href="9">Details</a>
	<a href="0">Details</a>
</html>

i want to extract all the href values (1,2,3,4,5,6,7,8,9,0)
now if i use following xpath expression

//a/@href

from the xpath point of view i get with this query all the href's.
to check this you simply can test it at http://www.mizar.dk/XPath/Default.aspx

so my question is now, how i can achive that in rapidminer?

land · March 2010

Hi,
how do you want to store the values of the href after having them retrieved? Should each href be a single example or do you want to have multiple attributes?
This is important, because the ways totally differ.

Greetings,
Sebastian

simon_knoll · March 2010

for me, both ways would be interesting, as i have to extract different features for different purposes.

Greetings,
Simon

land · March 2010

Hi Simon,
sorry for the late answer, but I simply didn't find the time to answer questions here in the forum in the meanwhile. Here's a process that will show you how both ways work:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="296" width="480">
      <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="3" y="45">
        <parameter key="text" value="&lt;html&gt;&#10;&#9;&lt;a href=&quot;1&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;2&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;3&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;4&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;5&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;6&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;7&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;8&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;9&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;0&quot;&gt;Details&lt;/a&gt;&#10;&lt;/html&gt;"/>
      </operator>
      <operator activated="true" class="text:documents_to_data" expanded="true" height="76" name="Documents to Data" width="90" x="112" y="120">
        <parameter key="text_attribute" value="text"/>
      </operator>
      <operator activated="true" class="multiply" expanded="true" height="94" name="Multiply" width="90" x="246" y="120"/>
      <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="380" y="210">
        <parameter key="create_word_vector" value="false"/>
        <list key="specify_weights"/>
        <process expanded="true" height="585" width="904">
          <operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document" width="90" x="112" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="unimportant" value="//a/@href"/&gt;
            </list>
            <list key="namespaces"/>
            <parameter key="assume_html" value="false"/>
            <process expanded="true" height="585" width="904">
              <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information" width="90" x="45" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="hrefNumber" value="(.*)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="380" y="75">
        <parameter key="source_attribute" value="text"/>
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="AttributeName1" value="//a[1]"/>
          <parameter key="AttributeName2" value="//a[2]"/>
        </list>
        <list key="namespaces"/>
        <parameter key="assume_html" value="false"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Generate Extract" to_port="Example Set"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 2"/>
      <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Please keep in mind, that there's the restriction, that each example of an example set must have the same attributes, so creating attributes depending on a the content of a text cannot be done!

Greetings,
Sebastian

simon_knoll · March 2010

hi sebastian.
thank you, this realy helped me.
do you know where i can find the AttributeWeights and AttributeWeightsApplier operators at the rapidminer gui?

greetings,
simon

land · March 2010

Hi,
there are several weighting operators available in the Modeling / Attribute Weighting group. You can the use scale by weights operator for applying these weights.

Greetings,
Sebastian

simon_knoll · April 2010

Hi Sebastian,
thank you, but i did not figured out how i can "create" weights for different attributes and pipe them for instance to the "scale by weights" operator

best regards
simon

land · April 2010

Hi,
take a look at the Data to Weights operator. With this you can convert an example set to a weight vector. You could create an example set having this weights for example with the logging funtionality and finally turn the log into a ExampleSet by using the log to data operator.

Greetings,
Sebastian

simon_knoll · April 2010

Hey Sebastian,
thank you for your answer, but i dont get it.
So i have a process like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="546" width="1016">
      <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="45" y="165">
        <parameter key="text" value="&lt;html&gt;&#13;&#10;&#9;&lt;head&gt;&lt;title&gt;Der Titel ist sehr toll&lt;/title&gt;&lt;/head&gt;&#13;&#13;&#10;&#9;&lt;a href=&quot;http://f12010.info&quot;&gt;formel1&lt;/a&gt;&#13;&#10;&#9;&#13;&lt;a href=&quot;http://dsds-2009.info&quot;&gt;und einen dritten link&lt;/a&gt;&#13;&#10;&#9;&lt;a href=&quot;http://simonknoll.com&quot;&gt;semmel&lt;/a&gt;&#13;&#10;&#9;&lt;title&gt;Wir Haben auch einen zweitet Titel&lt;/title&gt;&#10;&lt;/html&gt;"/>
        <parameter key="label_type" value="numeric"/>
      </operator>
      <operator activated="true" class="multiply" expanded="true" height="94" name="Multiply" width="90" x="179" y="165"/>
      <operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents (2)" width="90" x="313" y="255">
        <parameter key="create_word_vector" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document (2)" width="90" x="394" y="30">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="html_linktext" value="//h:a/text()"/>
            </list>
            <list key="namespaces"/>
            <process expanded="true">
              <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information (2)" width="90" x="394" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="use_it" value="(.*)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information (2)" to_port="document"/>
              <connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document (2)" to_port="document"/>
          <connect from_op="Cut Document (2)" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
        <parameter key="create_word_vector" value="false"/>
        <process expanded="true">
          <operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document" width="90" x="246" y="165">
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="html_title" value="//h:title/text()"/>
            </list>
            <list key="namespaces"/>
            <process expanded="true">
              <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information" width="90" x="246" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="use_it" value="(.*)"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
              </operator>
              <connect from_port="segment" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="document" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="generate_id" expanded="true" height="76" name="Generate ID" width="90" x="447" y="30"/>
      <operator activated="true" class="generate_id" expanded="true" height="76" name="Generate ID (2)" width="90" x="447" y="255"/>
      <operator activated="true" class="union" expanded="true" height="76" name="Union" width="90" x="581" y="120"/>
      <connect from_op="Create Document" from_port="output" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Process Documents (2)" to_port="documents 1"/>
      <connect from_op="Process Documents (2)" from_port="example set" to_op="Generate ID (2)" to_port="example set input"/>
      <connect from_op="Process Documents" from_port="example set" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Union" to_port="example set 1"/>
      <connect from_op="Generate ID (2)" from_port="example set output" to_op="Union" to_port="example set 2"/>
      <connect from_op="Union" from_port="union" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

in this process i extracted some features from a html document(for simplicity in this process generated by the "Create Document" operator).
these extracted features result in the following example set


Row No.	id	query_key	use_it
-----------------------------------------------------------------
1	1.0	html_title	Der Titel ist sehr toll
2	2.0	html_title	Wir Haben auch einen zweitet Titel
3	1.0	html_linktext	formel1
4	2.0	html_linktext	und einen dritten link
5	3.0	html_linktext	semmel

now my question. how i can add weighting for the different features that i extracted (e.g weight html_title with 2 and html_linktext with 1) wich then maybe could result in such a example set(or how ever a weightng looks like, i added a weight column just to get the point):


Row No.	id	query_key	use_it					weight
---------------------------------------------------------------------------------
1	1.0	html_title	Der Titel ist sehr toll			2
2	2.0	html_title	Wir Haben auch einen zweitet Titel	2
3	1.0	html_linktext	formel1					1
4	2.0	html_linktext	und einen dritten link			1
5	3.0	html_linktext	semmel					1

thanks in advance
simon

TobiasMalbrecht · April 2010

Hi Simon,

if this weight should only depend on the query_key this is no problem. Simply use the [tt]Generate Attributes[/tt] operator and use [tt]if(query_key="html_title",2,1)[/tt] as expression. Of course, you can nest the [tt]if(...,...,...)[/tt] expressions as you would like to.

Kind regards,
Tobias

simon_knoll · July 2010

Thank you for your advice.

my question is now, how can i feed a k-means algorithm with this data, if i want to cluster the documents regarding the extracted features. if im just giving the resulting exampleset as input, it clusters every single example for its own. but i want to cluster the documents and not the extractions.
any advice?

best regards
simon

simon_knoll · July 2010

maybe i post a screenshot of an example set

here i have an exampleset with several examples describing 2 different objects.
now if i want to apply a clustering algorithm on this, and i want to cluster these 2 objects (in reality there are obviously more than just 2 objects) and not every single example, how i have to do?

best regards
simon knoll

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Information Retrieval an weighting by html tags"

Answers