Options

find common attributes among two examplesets

HamedfHamedf Member Posts: 3 Contributor I
Good Day!
i have two example sets with many attributes which are not same completely.
want to find common attributes among them and filter example-sets based on common attributes only.
examples (values) are nor important. the data-set structure is only issue.

Regards

Answers

  • Options
    kaymankayman Member Posts: 662 Unicorn
    Maybe use the superset option? 

    This allows you to merge the two datasets, and then you filter out the ones wich are not common.

    One way to do this would be to generate an identifier for both sets (e.g. generate attribute set1 and set2 for both respectively), the create a superset, filter cases that have both set1 and set2, next remove empty attributes.

    Bit hard to explain without better understanding the actual data but it's a quick and dirty way to achieve this.
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    yes that works. Or just create an identifier (Generate ID) and do an inner join.
  • Options
    tambergetamberge Member Posts: 6 Contributor II
    edited May 2019

    hi kayman, hi sgenzer:

    I have the same issue, however I find it hard to execute the hint you have given.

    So I my case have two examplesets: Both are keyword-document-matrices, so text data converted to structural data in which each attribute defines a keyword, that appears in the set of documents and each example represents a document.

    Now I want to find out which keywords both matrices (Not Examples/Documents) have in common.

    I tried both of the described ways, but none was sufficient.

    Is there anything that I have to keep in mind doing that?

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @tamberge I think you're going to have to give us some actual data and a process XML to play with on this. It's really hard (at least for me) to understand your situation without it.

    Scott

  • Options
    tambergetamberge Member Posts: 6 Contributor II
    edited May 2019

    Hi @sgenzer, sorry.. sure please find the XML code and the data enclosed.

    If there is anything wrong with the uploading format, please let me know!

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.000-BETA">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.000-BETA" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.3.000-BETA" expanded="true" height="68" name="Retrieve PreppedDatabased_TF_00" width="90" x="45" y="238">
            <parameter key="repository_entry" value="//20190923_Outlier Detection/01_Data/012_Single/PreppedDatabased_TF_00"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="9.3.000-BETA" expanded="true" height="82" name="Generate ID (2)" width="90" x="246" y="238">
            <parameter key="create_nominal_ids" value="false"/>
            <parameter key="offset" value="47"/>
          </operator>
          <operator activated="true" class="retrieve" compatibility="9.3.000-BETA" expanded="true" height="68" name="Retrieve PreppedDatabase" width="90" x="45" y="34">
            <parameter key="repository_entry" value="//20190503_PatentDataNLP/001_Data/PreppedDatabase"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="9.3.000-BETA" expanded="true" height="82" name="Generate ID" width="90" x="246" y="34">
            <parameter key="create_nominal_ids" value="false"/>
            <parameter key="offset" value="0"/>
          </operator>
          <operator activated="true" class="superset" compatibility="9.3.000-BETA" expanded="true" height="82" name="Superset" width="90" x="447" y="34">
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <connect from_op="Retrieve PreppedDatabased_TF_00" from_port="output" to_op="Generate ID (2)" to_port="example set input"/>
          <connect from_op="Generate ID (2)" from_port="example set output" to_op="Superset" to_port="example set 2"/>
          <connect from_op="Retrieve PreppedDatabase" from_port="output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Superset" to_port="example set 1"/>
          <connect from_op="Superset" from_port="superset 1" to_port="result 1"/>
          <connect from_op="Superset" from_port="superset 2" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    



  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    ok there's probably a cleaner way to do this but this works :smile:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.000-BETA">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.000-BETA" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.3.000-BETA" expanded="true" height="68" name="Retrieve PreppedDatabase (2)" width="90" x="45" y="34">
            <parameter key="repository_entry" value="//LocalRepository/PreppedDatabase"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="9.3.000-BETA" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
            <parameter key="attribute_filter_type" value="value_type"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="numeric"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="transpose" compatibility="9.3.000-BETA" expanded="true" height="82" name="Transpose" width="90" x="313" y="34"/>
          <operator activated="true" class="select_attributes" compatibility="9.3.000-BETA" expanded="true" height="82" name="Select Attributes (3)" width="90" x="447" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="id"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="retrieve" compatibility="9.3.000-BETA" expanded="true" height="68" name="Retrieve PreppedDatabased_TF_00" width="90" x="45" y="238">
            <parameter key="repository_entry" value="//LocalRepository/PreppedDatabased_TF_00"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="9.3.000-BETA" expanded="true" height="82" name="Select Attributes (2)" width="90" x="179" y="238">
            <parameter key="attribute_filter_type" value="value_type"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="numeric"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="transpose" compatibility="9.3.000-BETA" expanded="true" height="82" name="Transpose (2)" width="90" x="313" y="238"/>
          <operator activated="true" class="select_attributes" compatibility="9.3.000-BETA" expanded="true" height="82" name="Select Attributes (4)" width="90" x="447" y="238">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="id"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="concurrency:join" compatibility="9.3.000-BETA" expanded="true" height="82" name="Join" width="90" x="648" y="136">
            <parameter key="remove_double_attributes" value="true"/>
            <parameter key="join_type" value="inner"/>
            <parameter key="use_id_attribute_as_key" value="true"/>
            <list key="key_attributes"/>
            <parameter key="keep_both_join_attributes" value="false"/>
          </operator>
          <connect from_op="Retrieve PreppedDatabase (2)" from_port="output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Transpose" to_port="example set input"/>
          <connect from_op="Transpose" from_port="example set output" to_op="Select Attributes (3)" to_port="example set input"/>
          <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Join" to_port="left"/>
          <connect from_op="Retrieve PreppedDatabased_TF_00" from_port="output" to_op="Select Attributes (2)" to_port="example set input"/>
          <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Transpose (2)" to_port="example set input"/>
          <connect from_op="Transpose (2)" from_port="example set output" to_op="Select Attributes (4)" to_port="example set input"/>
          <connect from_op="Select Attributes (4)" from_port="example set output" to_op="Join" to_port="right"/>
          <connect from_op="Join" from_port="join" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    


Sign In or Register to comment.