HOW CAN I UNITE TWO DIFFERENT DATASETS

mariak · July 2021

Hi!
Im new here and I wanted to ask how can I possibly unite two different datasets, when I don't have an attribute in neither of them that can be recognised as an id attribute.

Thank you

rfuentealba · July 2021

Hello @mariak

There are several combinations of datasets, but they all share something in common: they must have at least one attribute that glues them.

If for example:

Dataset A { id, name, description }
Dataset B { id, date of birth }

You can use "Join" to define those, and configure if it's an inner (the new joined dataset will contain only data for the ids that are in both A and

, left (the new joined dataset will contain all the data from Dataset A and will put blanks on the data missing from Dataset

, right (same but B to A), or outer (will put all the data together and put blanks on the fields that don't exist).

If for example:

Dataset A { id, name, description }
Dataset B { id, name, description }

You can use "Append" to put Dataset B below Dataset A.

There are others:

- Set Minus will substract dataset B from dataset A.
- Intersect will filter dataset A with values from dataset B (so, the intersection)
- Union is the same as Append, but not duplicating the results (I believe)

So, if you give us an example, we can come up with something that fits your needs.

All the best,

Rod.

mariak · July 2021

Hello and thank you for your reply!

I can give you an example. I want to unite two different datasets.
It's about questionnaires taken from students from two different schools which are enrolled to Math and Language courses.

Moreover some of the students are enrolled in both courses, but there is no apparent way to find out who are they, bc id attribute is missing. Only clue is, we have some attributes like school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet which can be a clue, but how can this be a strong one?

So, we have data attributes which include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Also, the two datasets provide info regarding the performance in two distinct subjects: Mathematics and Portuguese language . Plus In [Cortez and Silva, 2008], the two datasets were modelled under binary/five-level classification and regression tasks.

As I already said, in both datasets id attribute is missing.

All of the attributes are named the same but there' s a problem: There are in both datasets attributes labeled under G1, G2, G3( grade1, grade2, grade3) which represent different values, since in the first dataset G1, G2, G3 are referred to grades for three periods for "Portuguese" and in the other dataset we have exact same labels (G1, G2, G3) which refer to Math grades.
When those two are united they 're merged and, for example, I cannot tell which G1 is for Portuguese or Math.

Moreover we have 382 students from both schools who are enrolled in both courses :

[d3=merge(d1,d2,by=c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet")) print(nrow(d3)) # 382 students]

So an another question is : how can I trace down those students, since there is no id attribute there? I tried to isolate them with Select Attribute by [school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet]

but, to begin with, this can't be a safe way, plus I need all 32 attributes for my analysis (or at least half of them), since I want to operate correlation matrix, k means, decision trees.

i'm so sorry for this big essay. I tried to be clear enough. Thank you so much in advance

rfuentealba · July 2021

Hello @mariak,

First of all, you can use a Rename operator to rename the columns G1, G2 and G3 to Math G1, Math G2 and Math G3 (or Portuguese G1, Portuguese G2 and Portuguese G3, where it's needed). That rename should be done before the merge, ofc., so you can get that data without confusion.

Dataset1 + Dataset2 should be 382 students, right?

If Dataset1 (Math) has:

school, sex, age, address, famsize, pstatus, medu, fedu, mjob, reason, nursery, internet

And Dataset2 (Portuguese) has:

school, sex, age, address, famsize, pstatus, medu, fedu, mjob, reason, nursery, internet

I would use kind of a list of Joins with all these attributes and see if the result is 382:

The Left join between Math and Portuguese will tell you all the students that are in Math but not in Portuguese.
The Right join between Math and Portuguese will tell you all the students that are in Portuguese but not in Math.

The Inner join between Math and Portuguese will tell you all the students that might be in both <- if this is 382, great! And this solves your question: you shouldn't Select Attributes but rather join by those, because that is why you're missing the rest of the 32 attributes to begin with.

Hope this helps.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

HOW CAN I UNITE TWO DIFFERENT DATASETS

Answers