image

🎉 🎉 RAPIDMINER 9.10 IS OUT!!! 🎉🎉

Download the latest version helping analytics teams accelerate time-to-value for streaming and IIOT use cases.

CLICK HERE TO DOWNLOAD

HOW CAN I UNITE TWO DIFFERENT DATASETS

mariakmariak Member Posts: 5 Newbie
Hi!
Im new here and I wanted to ask how can I possibly unite two different datasets, when I don't have an attribute in neither of them that can be recognised as an id attribute.

Thank you

Answers

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 563   Unicorn
    Hello @mariak

    There are several combinations of datasets, but they all share something in common: they must have at least one attribute that glues them.

    If for example:

    Dataset A { id, name, description }
    Dataset B { id, date of birth }

    You can use "Join" to define those, and configure if it's an inner (the new joined dataset will contain only data for the ids that are in both A and B), left (the new joined dataset will contain all the data from Dataset A and will put blanks on the data missing from Dataset B), right (same but B to A), or outer (will put all the data together and put blanks on the fields that don't exist).

    If for example:

    Dataset A { id, name, description }
    Dataset B { id, name, description }

    You can use "Append" to put Dataset B below Dataset A.

    There are others:

    - Set Minus will substract dataset B from dataset A.
    - Intersect will filter dataset A with values from dataset B (so, the intersection)
    - Union is the same as Append, but not duplicating the results (I believe)

    So, if you give us an example, we can come up with something that fits your needs.

    All the best,

    Rod.












    mariak
  • mariakmariak Member Posts: 5 Newbie
    edited July 2
    Hello and thank you for your reply!

    I can give you an example. I want to unite two different datasets.
    It's about questionnaires taken from students from two different schools which are enrolled to Math and  Language courses.

    Moreover some of the students are enrolled in both courses, but there is no apparent way to find out who are they, bc id attribute is missing. Only clue is, we have some attributes like school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet which can be a clue, but  how can this be a strong one?


     So, we have  data attributes which include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Also, the two datasets provide info regarding the performance in two distinct subjects: Mathematics and Portuguese language . Plus In [Cortez and Silva, 2008], the two datasets were modelled under binary/five-level classification and regression tasks.

    As I already said, in both datasets id attribute is missing. 

    All of the attributes are named the same  but there' s a problem: There are in both datasets attributes labeled under G1, G2, G3( grade1, grade2, grade3)  which represent different values, since in the first dataset G1, G2, G3 are referred to grades for three periods for "Portuguese" and in the other dataset we have exact same labels (G1, G2, G3) which refer to Math grades. 
    When those two are united they 're merged and, for example, I cannot tell which G1 is for Portuguese or Math.

    Moreover we have 382 students from both schools who are enrolled in both courses :

    [d3=merge(d1,d2,by=c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet")) print(nrow(d3)) # 382 students]

    So an another question is : how can I trace down those students, since there is no id attribute there?  I tried to isolate them with Select Attribute by [school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet] 

    but, to begin with, this can't be a safe way,  plus I need all 32 attributes for my analysis (or at least half of them), since I want to operate correlation matrix, k means, decision trees.

    i'm so sorry for this big essay. I tried to be clear enough. Thank you so much in advance
    :)
  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 563   Unicorn
    Hello @mariak,

    First of all, you can use a Rename operator to rename the columns G1, G2 and G3 to Math G1, Math G2 and Math G3 (or Portuguese G1, Portuguese G2 and Portuguese G3, where it's needed). That rename should be done before the merge, ofc., so you can get that data without confusion.

    Dataset1 + Dataset2 should be 382 students, right?

    If Dataset1 (Math) has:

    school, sex, age, address, famsize, pstatus, medu, fedu, mjob, reason, nursery, internet

    And Dataset2 (Portuguese) has:

    school, sex, age, address, famsize, pstatus, medu, fedu, mjob, reason, nursery, internet

    I would use kind of a list of Joins with all these attributes and see if the result is 382:

    The Left join between Math and Portuguese will tell you all the students that are in Math but not in Portuguese.
    The Right join between Math and Portuguese will tell you all the students that are in Portuguese but not in Math.

    The Inner join between Math and Portuguese will tell you all the students that might be in both <- if this is 382, great! And this solves your question: you shouldn't Select Attributes but rather join by those, because that is why you're missing the rest of the 32 attributes to begin with.

    Hope this helps.
    lionelderkrikor
Sign In or Register to comment.