"Information Retrieval an weighting by html tags"

simon_knollsimon_knoll Member Posts: 40 Contributor II
edited May 2019 in Help
Hello,
is there a possibillity within rapidminer to weight extracted terms by the html or xml tags where they are entailed?

Example:
"<h1>Stock Quotes</h1>"

is rated higher than

"<h4>Phone number</h4>"

regards
Simon Knoll
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Simon,
    I included it today into the new TextProcessing Extension of RapidMiner 5. The current Plugin does not support this, so you might wait until we release RapidMiner 5...

    Greetings,
      Sebastian
  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    Hello Sebastian,
    could you do me a favor and show me a short example, how i can apply weight for html tags or which operators i need?

    regards,
    Simon Knoll
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Simon,
    the Text Processing Extension contains Operators for extracting XPath querries. It's called Generate Extract. If you have stored the contents of a web page in an ExampleSet, you might use this operator to extract the content of a h4 tag as a new attribute. If you take a look at the current version of the Process Documents from Data operator, it allows you to select attributes from where the text should be taken. In this list, you can also assign a weight to each attribute. Combining these two things should suit your needs.
    If this does not proof helpful, we could think of implementing some sort of weight applier, that will assing weights on tokens if it fulfills some condition.

    Greetings,
      Sebastian
  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    thanks for your help.

    but i've got some problems with the "generate extract" operator. more precise, im not getting any results, furthermore im getting empty results :-)
    maybe im using it in the wrong way

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <parameter key="parallelize_main_process" value="true"/>
        <process expanded="true" height="746" width="1091">
          <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="313" y="165">
            <parameter key="text" value="&lt;html&gt;&#10;&lt;title&gt;Hallo Titel&lt;/title&gt;&#10;&lt;h4&gt;Hallo Überschrift 3&lt;/h4&gt;&#10;&lt;h3&gt;Hallo Überschrift 3&lt;/h3&gt;&#10;&lt;p&gt;&lt;h4&gt;Ein H4&lt;/h4&gt; &lt;span&gt;in einem Paragraph&lt;/span&gt;&lt;/p&gt;&#10;&lt;/html&gt;"/>
          </operator>
          <operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents" width="90" x="581" y="75">
            <process expanded="true" height="724" width="770">
              <connect from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="782" y="75">
            <parameter key="source_attribute" value="source_ATTR"/>
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="title_html" value="//h:title/text()"/>
            </list>
            <list key="namespaces"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Generate Extract" to_port="Example Set"/>
          <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    regards,
    simon
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Simon,
    the problem with your setup is, that the source attribute does not exists. My problem with that is, that the operator does not complain about this, but instead simply doesn't deliver anything. I changed that behavior...
    For getting the text into an attribute, you can uncheck the create_word_vector parameter in the Process Document and instead add Keep_text. Then a new attribute called text will be added containing the text. You can select this for the generate extract operator and then it works as below:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <parameter key="parallelize_main_process" value="true"/>
        <process expanded="true" height="746" width="1091">
          <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="112" y="75">
            <parameter key="text" value="&lt;html&gt;&#10;&lt;title&gt;Hallo Titel&lt;/title&gt;&#10;&lt;h4&gt;Hallo Überschrift 3&lt;/h4&gt;&#10;&lt;h3&gt;Hallo Überschrift 3&lt;/h3&gt;&#10;&lt;p&gt;&lt;h4&gt;Ein H4&lt;/h4&gt; &lt;span&gt;in einem Paragraph&lt;/span&gt;&lt;/p&gt;&#10;&lt;/html&gt;"/>
          </operator>
          <operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents" width="90" x="246" y="75">
            <parameter key="create_word_vector" value="false"/>
            <parameter key="keep_text" value="true"/>
            <process expanded="true" height="724" width="770">
              <connect from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="380" y="75">
            <parameter key="source_attribute" value="text"/>
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="title_html" value="//h:title/text()"/>
            </list>
            <list key="namespaces"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Generate Extract" to_port="Example Set"/>
          <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Greetings,
      Sebastian
  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    thanks, now it works also for me. but still i got some questions

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <parameter key="logverbosity" value="3"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="1"/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="parallelize_main_process" value="false"/>
        <process expanded="true" height="629" width="950">
          <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="112" y="255">
            <parameter key="text" value="&lt;html&gt;&#10;&#9;&lt;a href=&quot;1&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;2&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;3&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;4&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;5&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;6&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;7&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;8&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;9&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;0&quot;&gt;Details&lt;/a&gt;&#10;&lt;/html&gt;&#10;"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="0"/>
          </operator>
          <operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents" width="90" x="447" y="255">
            <parameter key="create_word_vector" value="false"/>
            <parameter key="vector_creation" value="0"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="keep_text" value="true"/>
            <parameter key="prune_method" value="0"/>
            <parameter key="prunde_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="5.0"/>
            <parameter key="prune_above_rank" value="5.0"/>
            <parameter key="datamanagement" value="7"/>
            <parameter key="parallelize_vector_creation" value="false"/>
            <process expanded="true" height="629" width="950">
              <connect from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="648" y="255">
            <parameter key="source_attribute" value="text"/>
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Nominal"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="DetailsPage" value="//h:a[text()='Details']/@href"/&gt;
            </list>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <parameter key="value_seperator" value=";"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Generate Extract" to_port="Example Set"/>
          <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    why im gettin' here just one result and not every href entry seperated by ";"
    regards
    simon
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Simon,
    as the operator documentation tries to say, if a query results in an enumeration of items like for example "en,de,fr", then this values are separated using the given characters. But anyway you have to enter the exact search expression more than once to specify more than one attribute name. Where should the operator store the second value, if you enter only one attribute?

    Greetings,
      Sebastian
  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    hello sebastian,
    unfortunatly i dont understand your suggestion. so what i want to achive is following:
    having this "html" code

    <html>
    <a href="1">Details</a>
    <a href="2">Details</a>
    <a href="3">Details</a>
    <a href="4">Details</a>
    <a href="5">Details</a>
    <a href="6">Details</a>
    <a href="7">Details</a>
    <a href="8">Details</a>
    <a href="9">Details</a>
    <a href="0">Details</a>
    </html>
    i want to extract all the href values (1,2,3,4,5,6,7,8,9,0)
    now if i use following xpath expression
    //a/@href
    from the xpath point of view i get with this query all the href's.
    to check this you simply can test it at http://www.mizar.dk/XPath/Default.aspx

    so my question is now, how i can achive that in rapidminer?

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    how do you want to store the values of the href after having them retrieved? Should each href be a single example or do you want to have multiple attributes?
    This is important, because the ways totally differ.

    Greetings,
      Sebastian
  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    for me, both ways would be interesting, as i have to extract different features for different purposes.

    Greetings,
    Simon
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Simon,
    sorry for the late answer, but I simply didn't find the time to answer questions here in the forum in the meanwhile. Here's a process that will show you how both ways work:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="296" width="480">
          <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="3" y="45">
            <parameter key="text" value="&lt;html&gt;&#10;&#9;&lt;a href=&quot;1&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;2&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;3&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;4&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;5&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;6&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;7&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;8&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;9&quot;&gt;Details&lt;/a&gt;&#10;&#9;&lt;a href=&quot;0&quot;&gt;Details&lt;/a&gt;&#10;&lt;/html&gt;"/>
          </operator>
          <operator activated="true" class="text:documents_to_data" expanded="true" height="76" name="Documents to Data" width="90" x="112" y="120">
            <parameter key="text_attribute" value="text"/>
          </operator>
          <operator activated="true" class="multiply" expanded="true" height="94" name="Multiply" width="90" x="246" y="120"/>
          <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="380" y="210">
            <parameter key="create_word_vector" value="false"/>
            <list key="specify_weights"/>
            <process expanded="true" height="585" width="904">
              <operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document" width="90" x="112" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="unimportant" value="//a/@href"/&gt;
                </list>
                <list key="namespaces"/>
                <parameter key="assume_html" value="false"/>
                <process expanded="true" height="585" width="904">
                  <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information" width="90" x="45" y="30">
                    <parameter key="query_type" value="Regular Expression"/>
                    <list key="string_machting_queries"/>
                    <list key="regular_expression_queries">
                      <parameter key="hrefNumber" value="(.*)"/>
                    </list>
                    <list key="regular_region_queries"/>
                    <list key="xpath_queries"/>
                    <list key="namespaces"/>
                  </operator>
                  <connect from_port="segment" to_op="Extract Information" to_port="document"/>
                  <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="document" to_op="Cut Document" to_port="document"/>
              <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:generate_extract" expanded="true" height="60" name="Generate Extract" width="90" x="380" y="75">
            <parameter key="source_attribute" value="text"/>
            <parameter key="query_type" value="XPath"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries"/>
            <list key="regular_region_queries"/>
            <list key="xpath_queries">
              <parameter key="AttributeName1" value="//a[1]"/>
              <parameter key="AttributeName2" value="//a[2]"/>
            </list>
            <list key="namespaces"/>
            <parameter key="assume_html" value="false"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Generate Extract" to_port="Example Set"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 2"/>
          <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    Please keep in mind, that there's the restriction, that each example of an example set must have the same attributes, so creating attributes depending on a the content of a text cannot be done!

    Greetings,
      Sebastian
  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    hi sebastian.
    thank you, this realy helped me.
    do you know where i can find the AttributeWeights and AttributeWeightsApplier operators at the rapidminer gui?

    greetings,
    simon
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    there are several weighting operators available in the Modeling / Attribute Weighting group. You can the use scale by weights operator for applying these weights.

    Greetings,
      Sebastian
  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    Hi Sebastian,
    thank you, but i did not figured out how i can "create" weights for different attributes and pipe them for instance to the "scale by weights" operator

    best regards
    simon
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    take a look at the Data to Weights operator. With this you can convert an example set to a weight vector. You could create an example set having this weights for example with the logging funtionality and finally turn the log into a ExampleSet by using the log to data operator.

    Greetings,
      Sebastian
  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    Hey Sebastian,
    thank you for your answer, but i dont get it.
    So i have a process like this:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="546" width="1016">
          <operator activated="true" class="text:create_document" expanded="true" height="60" name="Create Document" width="90" x="45" y="165">
            <parameter key="text" value="&lt;html&gt;&#13;&#10;&#9;&lt;head&gt;&lt;title&gt;Der Titel ist sehr toll&lt;/title&gt;&lt;/head&gt;&#13;&#13;&#10;&#9;&lt;a href=&quot;http://f12010.info&quot;&gt;formel1&lt;/a&gt;&#13;&#10;&#9;&#13;&lt;a href=&quot;http://dsds-2009.info&quot;&gt;und einen dritten link&lt;/a&gt;&#13;&#10;&#9;&lt;a href=&quot;http://simonknoll.com&quot;&gt;semmel&lt;/a&gt;&#13;&#10;&#9;&lt;title&gt;Wir Haben auch einen zweitet Titel&lt;/title&gt;&#10;&lt;/html&gt;"/>
            <parameter key="label_type" value="numeric"/>
          </operator>
          <operator activated="true" class="multiply" expanded="true" height="94" name="Multiply" width="90" x="179" y="165"/>
          <operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents (2)" width="90" x="313" y="255">
            <parameter key="create_word_vector" value="false"/>
            <process expanded="true">
              <operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document (2)" width="90" x="394" y="30">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="html_linktext" value="//h:a/text()"/>
                </list>
                <list key="namespaces"/>
                <process expanded="true">
                  <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information (2)" width="90" x="394" y="30">
                    <parameter key="query_type" value="Regular Expression"/>
                    <list key="string_machting_queries"/>
                    <list key="regular_expression_queries">
                      <parameter key="use_it" value="(.*)"/>
                    </list>
                    <list key="regular_region_queries"/>
                    <list key="xpath_queries"/>
                    <list key="namespaces"/>
                  </operator>
                  <connect from_port="segment" to_op="Extract Information (2)" to_port="document"/>
                  <connect from_op="Extract Information (2)" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="document" to_op="Cut Document (2)" to_port="document"/>
              <connect from_op="Cut Document (2)" from_port="documents" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:process_documents" expanded="true" height="94" name="Process Documents" width="90" x="313" y="30">
            <parameter key="create_word_vector" value="false"/>
            <process expanded="true">
              <operator activated="true" class="text:cut_document" expanded="true" height="60" name="Cut Document" width="90" x="246" y="165">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="html_title" value="//h:title/text()"/>
                </list>
                <list key="namespaces"/>
                <process expanded="true">
                  <operator activated="true" class="text:extract_information" expanded="true" height="60" name="Extract Information" width="90" x="246" y="30">
                    <parameter key="query_type" value="Regular Expression"/>
                    <list key="string_machting_queries"/>
                    <list key="regular_expression_queries">
                      <parameter key="use_it" value="(.*)"/>
                    </list>
                    <list key="regular_region_queries"/>
                    <list key="xpath_queries"/>
                    <list key="namespaces"/>
                  </operator>
                  <connect from_port="segment" to_op="Extract Information" to_port="document"/>
                  <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <connect from_port="document" to_op="Cut Document" to_port="document"/>
              <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="generate_id" expanded="true" height="76" name="Generate ID" width="90" x="447" y="30"/>
          <operator activated="true" class="generate_id" expanded="true" height="76" name="Generate ID (2)" width="90" x="447" y="255"/>
          <operator activated="true" class="union" expanded="true" height="76" name="Union" width="90" x="581" y="120"/>
          <connect from_op="Create Document" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Process Documents (2)" to_port="documents 1"/>
          <connect from_op="Process Documents (2)" from_port="example set" to_op="Generate ID (2)" to_port="example set input"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Union" to_port="example set 1"/>
          <connect from_op="Generate ID (2)" from_port="example set output" to_op="Union" to_port="example set 2"/>
          <connect from_op="Union" from_port="union" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    in this process i extracted some features from a html document(for simplicity in this process generated by the "Create Document" operator).
    these extracted features result in the following example set

    Row No. id query_key use_it
    -----------------------------------------------------------------
    1 1.0 html_title Der Titel ist sehr toll
    2 2.0 html_title Wir Haben auch einen zweitet Titel
    3 1.0 html_linktext formel1
    4 2.0 html_linktext und einen dritten link
    5 3.0 html_linktext semmel
    now my question. how i can add  weighting for the different features that i extracted (e.g weight html_title with 2 and html_linktext with 1) wich then maybe could result in such a example set(or how ever a weightng looks like, i added a weight column just to get the point):

    Row No. id query_key use_it weight
    ---------------------------------------------------------------------------------
    1 1.0 html_title Der Titel ist sehr toll 2
    2 2.0 html_title Wir Haben auch einen zweitet Titel 2
    3 1.0 html_linktext formel1 1
    4 2.0 html_linktext und einen dritten link 1
    5 3.0 html_linktext semmel 1
    thanks in advance
    simon
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 295 RM Product Management
    Hi Simon,

    if this weight should only depend on the query_key this is no problem. Simply use the [tt]Generate Attributes[/tt] operator and use [tt]if(query_key="html_title",2,1)[/tt] as expression. Of course, you can nest the [tt]if(...,...,...)[/tt] expressions as you would like to.

    Kind regards,
    Tobias
  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    Thank you for your advice.

    my question is now, how can i feed a k-means algorithm with this data, if i want to cluster the documents regarding the extracted features. if im just giving the resulting exampleset as input, it clusters every single example for its own. but i want to cluster the documents and not the extractions.
    any advice?

    best regards
    simon
  • simon_knollsimon_knoll Member Posts: 40 Contributor II
    maybe i post a screenshot of an example set

    here i have an exampleset with several examples describing 2 different objects.
    now if i want to apply a clustering algorithm on this, and i want to cluster these 2 objects (in reality there are obviously more than just 2 objects) and not every single example, how i have to do?

    image

    best regards
    simon knoll
Sign In or Register to comment.