Specific case for Joining multiple datasets

Sunnyboy_nhSunnyboy_nh Member Posts: 10 Newbie

Hi everyboday,

In a data science project I have recieved  4 cleaned datasets on an intersting topic healthy diet to protects ourselselves against Covid-19. I have already imported these 4 data sets in Rapidminer Studio. before I analyse them with machine learning, modelling and statistical preditions I would like to join and aggrgate those 4 datasets into one single dataset. This is where I ahve encountered an underestanding problem how to go further....

All of those 4 cleaned datasets have exact the same 32 Columns and 170 rows. Only the information delivered in these tables as following  are different in their  %-value:

  1. Fat_Supply_Quantity_Data.csv

  2. Food_Supply_kcal_Data.csv

  3. Food_Supply_Quantity_kg_Data.csv

  4. Protein_Supply_Quantity_Data

How do you see a possibility to join those 4 datasets to one single datase,t although all have the same number of 32 columns and same 170 rows OR should I look at each of 4 datasets separately and process them seperated from eachtother? Can you support me with your insights?

Tagged:

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Hi @Sunnyboy_nh,

    A priori, if your 4 datasets have strictly the same attributes names you need the Append operator.
    Thus the resulting dataset will have 680 rows (170x4 rows) x 32 columns.

    After creating this resulting dataset, you can begin by submitting these dataset to AutoModel to create automatically 
    Machine Learning models based on your data.

    Good luck for your study. 

    Regards,

    Lionel
  • Sunnyboy_nhSunnyboy_nh Member Posts: 10 Newbie
    Thanks Lionel for the tip with Append instead of Join !
    I will try that but before hand another question rises now..... The same number of 170 rows in each dataset are the "Country" names  which are also the exact same  from where the data has been taken. How do I go about the new added rows/examples after Append ? Should I then rename each new added 170 batch of rows/examples according to the data values or what?

    And finally is it possible to use Append for on all multiple 4 datesets or is it only possible to do Append two at the the time?

    Regards
    Nader 
     
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    It matters how your data is organized and saying it is data in the same number of rows and columns doesn't clarify.   Are these 170 individuals who have 32x4 different observations about their diets?   Is this a time series dataset?  Or is it the same 32 attributes for 170x4 different people?  Or something else entirely?
    As Lionel says, if it is the same 32 attributes for 680 different people then Append should do the trick.  But if it is something different they you may need another type of Join or perhaps even Merge, and you may also need to pivot or transform your data.  Generally you need to think about the structure of the data you want to have at the end and work backwards.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Sunnyboy_nhSunnyboy_nh Member Posts: 10 Newbie
    Hi Telcontar120,

    Thanks for your further differntiation.
    Those 170 rows/examples as I wrote back to Lionel are 170 countries around the globe where the data has been taken from for the same 32 Coulumn/Attributes but data has been taken for 4 different observations  as you see here again:

     1. Fat_Supply_Quantity_Data.csv

      2. Food_Supply_kcal_Data.csv

      3. Food_Supply_Quantity_kg_Data.csv

      4. Protein_Supply_Quantity_Data.csv

    So the rows are not people but the same countries and the attributes are the same parameters but from 4 different aspects or observation in 4 seperated datasets which deliver different %-values of information in each datasets.

    SO do you still think Append will do the trick or if so should the new added rows be renamed with a suffix since they are all the same countries ?


    Regards

    Nader

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    You don't want Append in this case, since it sounds like you want to end up with a dataset that has 170 rows with 128 attributes, not 680x32.  The attributes in each file are similar but actually not the same.  
    So you should choose Join and then add each file's attributes one by one.
    But you are going to want to rename your attributes first so it is clear which file they come from.  You can do that with the Rename by Replacing operator.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Sunnyboy_nhSunnyboy_nh Member Posts: 10 Newbie
    edited May 2020


    Thanks, this last recommendation sounds good I think makes more sense not to append but join the datesets and rename their attributes/column by their files  before jouning at the end to one single dats set of 128 attributes x 170 rows.

    But since i have 4 datasers and operator Join only joibs 2 datasets at a time should I do the Join in 3 stages in series or is there a better easier way to join multiple datasets in the begining?

    Regards
    Nader
Sign In or Register to comment.