Generate Aggregation - Problem with non-existent attributes

colocolo Member Posts: 236 Maven
edited September 2019 in Help
Hello everybody,

I'm just trying to build some custom scoring functions (mainly with the "Generate Attributes" operator). Among other things I need to calculate with some word occurences. At the moment there is an example set with all the attributes from a previously created word vector. Now I want to sum up the occurrences of some similar words and store this value in a new attribute. As far as I know this can easily be done by the "Generate Aggregation" operator. Setting "regular_expression" as attribute filter type and using an expression for those similar words first seemed to work well for me. Adding some more expressions finally led to an error: "AttributeFactory: cannot create attribute with value type 'attribute_value' (0)!". This message results from regular expressions for attribute search which don't bring up any matches. Is this behavior intended? Since I don't know if the words fitting to my regex patterns are present in documents that shall be testet I would prefer a default value (0 for my sum aggregation). How can i avoid this problem? Do I need to check the existence of the desired attributes before (perhaps with "Select attributes" and somehow count the size of the resulting attribute set)? I hope my problem is understandable, it's more of a general question than a process specific problem. I would appreciate any hints and help.

Thanks in advance!
Matthias
Tagged:

Answers

  • haddockhaddock Member Posts: 849 Maven
    Hi,

    Context vectors can be used to work out which words are similar to which other words, in the same way that they can be used to postulate similarity between documents.

  • colocolo Member Posts: 236 Maven
    Hey,

    thank you for this additional information. Perhaps context vectors might become useful later...

    But for now I still need to know a good way to use "Generate Aggregation" with sum or count function for a set of attributes that is defined by a simple regex. It may in fact happen that no matching attributes will be found so I need a possibility to avoid errors resulting from this case. I just tried to solve this with pre-checking the existence of the desired attribute-regex but the way I chose can't be the right one (hopefully).
    I used "Select Attributes" with the same regex to get a list of matching attributes. I just didn't find a simple way to count attributes, but I think there is a solution that I simply didn't find. To be able to work on for the moment, I added an empty dummy attribute to avoid the same problem with the following "Generate Aggregation" (filter type: all, aggregation function: count), which should simply count all existing regular attributes. Combined with some mad "Multiply" connections and "Branch" Operators this leads to the desired result (adding a sum aggregation attribute only if matching terms for the regex exist in the word vector). But this definitly isn't the way to go, please help me to clean up my process again ;)

    Thanks,
    Matthias
  • haddockhaddock Member Posts: 849 Maven
    Hi Again,

    Here's how to count the attributes that fill the bill..
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="353" width="808">
          <operator activated="true" class="generate_data" expanded="true" height="60" name="Generate Data" width="90" x="160" y="68">
            <parameter key="target_function" value="simple non linear classification"/>
          </operator>
          <operator activated="true" class="work_on_subset" expanded="true" height="76" name="Work on Subset" width="90" x="293" y="66">
            <parameter key="attribute_filter_type" value="regular_expression"/>
            <parameter key="regular_expression" value=".*3|.*4"/>
            <process expanded="true" height="353" width="808">
              <operator activated="true" class="extract_macro" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="30">
                <parameter key="macro" value="blob"/>
                <parameter key="macro_type" value="number_of_attributes"/>
              </operator>
              <operator activated="true" class="log" expanded="true" height="76" name="Log" width="90" x="313" y="30">
                <list key="log">
                  <parameter key="Atts" value="operator.Extract Macro.value.macro_value"/>
                </list>
              </operator>
              <connect from_port="exampleSet" to_op="Extract Macro" to_port="example set"/>
              <connect from_op="Extract Macro" from_port="example set" to_op="Log" to_port="through 1"/>
              <connect from_op="Log" from_port="through 1" to_port="example set"/>
              <portSpacing port="source_exampleSet" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Work on Subset" to_port="example set"/>
          <connect from_op="Work on Subset" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Just to be clear about my previous reference to context-vectors, the example set that results from the RM document processes is a context-vector because it has a document identifier field, and a word-vector. If you can specify up front which similarities you are interested in then you can regex, if you don't you can use the context vectors to work some form of similarity for you.

  • colocolo Member Posts: 236 Maven
    Thank you haddock,

    I was expecting to find some counting feature within the "Select Attributes" operator. I didn't look at the "Extract Macro" in this context. I think there is some practical experience required to get familiar with the important operators and getting a feeling for their efficient usage. As soon as you solve one problem there is a new one already waiting (new topic). Hope you have some patience with beginners ;)
  • haddockhaddock Member Posts: 849 Maven
    Hi Colo,

    Glad to have been of some use  ;D Given the large number of operators, most users only gain familiarity with the operators appropriate for their domain, which tends to mean they are restricted by their pre-conceptions. That's why I bother with this forum, it makes me look into the other areas - only rarely do you get acknowledgement, let alone thanks!
    Hope you have some patience with beginners
    I am a grumpy old fart and don't bother to answer those who can't be bothered to help themselves - unlike the RM staffers who are saintly in their tolerance, young, and actually know something... bastards  :D
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    to continue our holy mission to bring some enlightenment to those who are no bastards or blessed with knowledge, let me add:
    If an operation can fail regularly, it might be appropriate to use the Handle Exception operator to get rid of this error and continue process execution. All you have to do is to ensure the rest of the process runs just fine. Usually a process branch helps a lot on this mission...

    Greetings,
      Sebastian

Sign In or Register to comment.