How to compare two set of attribute values?

AnushaAnusha Member Posts: 19 Maven
Hi All!

I have a dataset that has 7 columns. one column is like "no" and the other 6 columns are sets of two.
set-1: three columns are "attr1_1",attr1_2","attr1_3" .
set-2: other three columns are "attr2_1",attr2_2","attr2_3".
so I just want to compare these two sets of columns, if we any one column in the first set matching with second set I need to highlight a flag value as "1".

sample Input & Output:

Input:
no          attr1_1                   attr1_2               attr1_3                 attr2_1                      attr2_2                   attr2_3
234      "klo","12","78"         "jkl","13","78"      "jkl","14","89"         "klo","12","78"          "hj","31","4"          "kl","9","0"
456          "klo","12","78"       "klo","12","78"     "ko","12","78"       "jkl","13","78"           "jkl","13","78"         "hj","31","4" 


output:
no          attr1_1                   attr1_2               attr1_3                 attr2_1                      attr2_2                   attr2_3        flag
234      "klo","12","78"         "jkl","13","78"      "jkl","14","89"         "klo","12","78"          "hj","31","4"          "kl","9","0"      1
456          "klo","12","78"       "klo","12","78"     "ko","12","78"       "jkl","13","78"           "jkl","13","78"         "hj","31","4"   0



In the first row--"234", att1_1("klo","12","78") is macthed with attr2_1("klo","12","78") -------------output flag value becomes "1"
and second row--"456", none of (attr1)set-1 columns macthed with set-2 columns(attr2)-----------flag is "0"


Could anyone help me in solving this?

Thanks in Advance!

Best Answer

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist
    Solution Accepted
    I did not execute the process but could directly see a problem with your expression.
    if(%{loop_attribute}==%{loop_attribute1},1,0) 
    needs to be
    if(#{loop_attribute}==#{loop_attribute1},1,Flag)
    Explanation:
    - A macro with % is just the String value. A macro with # means that this is supposed to be an Attribute name.
    - If you do 1,0 previous Flag 1 replacements can be overwritten. That is why you need 1,Flag
    Happy Mining,
    Edin



Answers

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    you are also doing a cartesian join here (matching every row with every other row from the second example set), do I understand this correctly?

    Are you relying on the last number (attr1_1 = attr2_1) here or are you accepting matches in different attribute "numbers"?

    I would use Cartesian Product first and then one or two Loop Attributes operators depending on the comparison logic.

    For the "attribute number is relevant" case I'd loop over the attr1_.+ (regular expression) attributes, use Generate Macro to change attr1 to attr2 and compare the current macro value with the generated matching comparison value. Then Generate Attributes with flag = if (%{attr} == %{comparison}, 1, max(flag, 0)). (You would pre-created the flag attribute with the value 0.)

    Regards,

    Balázs
  • AnushaAnusha Member Posts: 19 Maven
    Hi @BalazsBarany,

    Thanks for the response.

    For your information:
    you are also doing a cartesian join here (matching every row with every other row from the second example set), do I understand this correctly?
    I'm not doing cartesian join here.

    Are you relying on the last number (attr1_1 = attr2_1) here or are you accepting matches in different attribute "numbers"?
    No, any column value from set-1 matches with any column value from set-2.

    Why do you want to use Cartesian Product here, I'm not understanding this?

    I have used 2 loop attributes, in each loop attribute I've selected column sets using regular expression. I haven't use generate macro because in the loop attribute one of the parameters is "attribute name macro". Inside 2nd loop attribute, I have used generate attribute with the if condition like flag=  if (%{attr1} == %{attr2}, 1, 0). "attr1" is the attribute name macro in 1st loop attribute operator and "attr2" is the attribute name macro in 2nd loop attribute. But it's not working as per my requirement.
    getting the in the IOobjectcollection folder. there are 2 examples set for each folder. In every example set has attr1_1,att1_2,attr2_1,attr2_2,flag these 5 columns only and values in all example sets are same.

    at the final output I need all columns with flag, not as an example set, how can I get this?

    Thanks in Advance.


  • AnushaAnusha Member Posts: 19 Maven
    For my requirement, I have developed a hardcoded process by using generate attribute, if condition for all combinations.
    In my above example 3 attributes in set-1 and 3 attributes in set-2, so if(attr1_1==attr2_1 || attr1_1== attr2_2 || attr1_1==attr2_3 || attr1_2== attr2_1 || attr1_2== attr2_2 || attr1_2==attr2_3 || attr1_3== attr2_1 || attr1_3== attr2_2 || attr1_3==attr2_3, "1","0").
    It's working fine but I don't want this static condition. I may have multiple columns in each set. How can I achieve this?

    can anyone help me, please?
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    the nested Loop Attributes is a good way to achieve what you want. Just be careful how you set it up. You'll need to check "reuse results" if you're working on the current example set instead of generating new ones. If it is not working, set up a breakpoint after the Generate Attributes that sets your "flag" attribute and check step by step.

    I assumed Cartesian Product because you are combining two sets to one and then comparing each element of Set 1 with each element of Set 2. This is the main use case for cartesian product (cartesian join). 

    If you want to keep the data separately, you can use nested Loop Examples, always filter for the current example (e. g. with Filter Example Range), do the comparison there (but that will also need Loop Attributes if it needs to be generic) and set up the resulting example set.

    Regards,
    Balázs
  • AnushaAnusha Member Posts: 19 Maven
    Hi @BalazsBarany,

    Thanks for the reply.

    Even after using loop attributes and generate attributes, not getting the required answer. flag value is "0" even though there is a match in the set-1 and set-2 attributes.
  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist
    Hi @Anusha ,

    the approach Balázs proposes should work.
    You basically just need 4 Operators:
    - Generate Attribute Flag with value 0
    - Loop Attributes only over the set1 Attributes (in your example using RegEx attr1_.*)
    - inside, Loop Attributes over all comparison sets (in your example using Regex attr.* using except expression attr1_.*)
    - inside second Loop Attributes, Generate Attributes where you compare both Loop Attributes (make sure to overwrite default macro name) and overwrite Flag with 1 or previous Flag value
    Important to check "reuse results"

    Happy Mining,
    Edin
  • AnushaAnusha Member Posts: 19 Maven
    edited May 2021
    Hi @Edin_Klapic & @BalazsBarany,

    I have followed the same procedure but I'm not getting flag value as "1", even the value of set-1 attributes matched with values of set-2 attributes.

    please find the below process.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.9.000">
      <operator activated="true" class="retrieve" compatibility="9.9.000" expanded="true" height="68" name="Retrieve" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Local Repository/data/drawing_after_split"/>
      </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="9.9.000">
      <operator activated="true" class="concurrency:loop_attributes" compatibility="9.9.000" expanded="true" height="82" name="Loop Attributes" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="regular_expression"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="regular_expression" value="attr1_.*"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="numeric_condition" value="&gt;0"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="attribute_name_macro" value="loop_attribute"/>
        <parameter key="reuse_results" value="true"/>
        <parameter key="enable_parallel_execution" value="true"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:loop_attributes" compatibility="9.9.000" expanded="true" height="82" name="Loop Attributes (3)" width="90" x="179" y="34">
            <parameter key="attribute_filter_type" value="regular_expression"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="regular_expression" value="attr.*"/>
            <parameter key="use_except_expression" value="true"/>
            <parameter key="except_regular_expression" value="attr1_.*"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="attribute_name_macro" value="loop_attribute1"/>
            <parameter key="reuse_results" value="true"/>
            <parameter key="enable_parallel_execution" value="true"/>
            <process expanded="true">
              <operator activated="true" class="generate_attributes" compatibility="9.9.000" expanded="true" height="82" name="Generate Attributes (3)" width="90" x="179" y="34">
                <list key="function_descriptions">
                  <parameter key="isdocup" value="if(%{loop_attribute}==%{loop_attribute1},1,0)"/>
                </list>
                <parameter key="keep_all" value="true"/>
              </operator>
              <connect from_port="input 1" to_op="Generate Attributes (3)" to_port="example set input"/>
              <connect from_op="Generate Attributes (3)" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Loop Attributes (3)" to_port="input 1"/>
          <connect from_op="Loop Attributes (3)" from_port="output 1" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
    </process>

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    I don't get a valid process when pasting your XML into RapidMiner but I looked in to the parameter values.

    You are using this expression: isdocup = if(%{loop_attribute}==%{loop_attribute1},1,0)

    However, if isdocup was already 1 but is reset later to 0, that's what you get as the end result. 

    Try something like: max(isdocup, if(%{loop_attribute}==%{loop_attribute1},1,0))

    So if isdocup was ever 1 in the current row, it stays that way.

    Regards,
    Balázs
Sign In or Register to comment.