Combining Data Sets

ari_31321ari_31321 Member Posts: 4 Newbie
Hi, I am new at rapidminer and i am wondering if there is a solution to the following question

Is there a possibility to match the following Data Set1 and Set2 with the logic, that the Needed Data Set can be obtained?
- "?" in Set2 should be replaced by the corresponding values of Set1
- if there is a unambiguous assignment such as in row 2, only the corresponding value should be added
- if there is a ambiguous assignment such as in row 1 and 3, all possible values should be added

Data Set1:
Attr_A       Attr_B       Attr_C
A_111     B_123         C_999
A_111     B_234         C_999
A_111     B_345         C_222
A_222     B_456         C_111
A_222     B_567         C_222

Data Set2:
Attr_1    Attr_A       Attr_B       Attr_C
 ...          A_111       ?                ?
 ...          A_111       B_234        ?
 ...          A_222       ?                ?

Data Set Needed:
Attr_1    Attr_A         Attr_B                                  Attr_C
 ...           A_111        B_123, B_234, B_345          C_999, C_222
 ...           A_111         B_234                                   C_999
 ...           A_222        B_456, B_567                       C_111, C_222

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,
    I think what you want is first Append Data_set1 and Data_set 2 with an Append (SuperSet) operator. Afterwards you use Aggregate and Concat on Attr_b and Attr_C?

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • ari_31321ari_31321 Member Posts: 4 Newbie
    Hi, thanks for the quick answer! 

    Data Set1 is some kind of a "Relation Table", where the relations between article numbers are shown
    (e.g. A_111 is related to B_123 and C_999, but there is also a relation between A_111 , B_234, C_999 and so on).

    Therefore Data Set1 won't have the same size as Data Set2. The Append Operator won't work in this case. 
    Maybe I am wrong with the joins operator group in general for this task. But so far I have no solution how Data Set2 could "make use" of Data Set1 to fill the missing cells in respect of the relation on the article numbers as they are shown in Data Set1.

    Thanks in advance for help!


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,
    that is why i would recommend the operator Append SuperSet, which also works if one example set is the superset of the other. It is part of Operator Toolbox extension.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • ari_31321ari_31321 Member Posts: 4 Newbie
    Thanks for explaining again. I haven't had that extension installed yet.
    Unfortunatelly your suggested solution does not work in my case.

    I try again to explain my question again, maybe that helps:

    Data Set2 should use Data Set1 (as a relation table) to fill the missing values in Data Set2/ or create the new Data Set Needed. 
    That means e.g. that Data Set2 should "look up" in Data Set1 and recognize that value A_111 is related to B_123, B_234, B_345. So the missing value in  Attr. B of the Needed Data Set should be "B_123, B_234, B_345". (The same question with Att C for the Needed Data Set).
    As a further aspect Data Set2 should also "understand" if it contains already the values A_111 and B_234, that those two values in combination have only a relation with C_999 (see relation table/ Data Set1).

    I hope the explanation helps to understand my objective. Sorry for the long explanation.

    Thanks again!
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    This is a multiple step operation.  First you are going to have to aggregate to get all the values of B and C associated with a given value of A.  You can then concatenate these into a single cell.  You can then use that to join back with the original dataset with all the missings.

    Of course, you may want to take a step back and ask whether you actually need this data structure, or whether another structure would be suitable.  If you are doing anything with machine learning, this is going to be an awkward structure to use. join based on attribute A.  You have a couple of alternatives:
    • you could generate a dataset with only one row for each unique value of A and all associated values of B and C (this is actually easier to generate)
    • you could generate a dataset with each unique combination of A, B, and C 

    Generally speaking having multiple distinct values in a single db cell is harder for computers to deal with.
    Also, you said that Martin's approach would not work, but I am not sure why you said that, did you actually try it out?

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • ari_31321ari_31321 Member Posts: 4 Newbie
    Thanks for explaining again in more detail! I could find a way to solve it using your further explication.



Sign In or Register to comment.