how to combine 2 example sets where both have some missing values?

781194025781194025 Member Posts: 32 Contributor I
edited July 2019 in Help

This has been frustrating the !!!$@#Q$ OUT OF ME!!!

2 different sets, similar attributes, IDs match up fine.. so why isn't there an option to combine them and use the example from whichever one isn't missing?!??!

  • Join left? NOPE!

  • Join right? NOPE!

  • Join outer? NOPE!

  • Append? NOPE

  • Union? NOPE!

  • Superset? NOPE!

What am I missing?!?! HELP!

 

the datasets look like this>

 

        A     B      C

1      x     y       ?

2      x     y       ?

3      ?     z       z

 

Tagged:

Answers

  • AndrewAndrew RapidMiner Certified Expert, RapidMiner Certified Master, Member Posts: 47 Guru

    Hello 

     

    Could you provide the 2 example sets that are input and also what you need as the output. Currently, I'm only seeing one example set as input

     

    regads

     

    Andrew

  • FBTFBT Member Posts: 106 Unicorn

    As @Andrew said, a complete example makes it easier for community members to help you. Having said that, I believe the building block union append that another user created could help to solve your issue. 

  • 781194025781194025 Member Posts: 32 Contributor I
    For example the 2 example sets only differ by the results in Sentiment and SentimentScore, which is missing. How do I get them to merge into a single row?
  • AndrewAndrew RapidMiner Certified Expert, RapidMiner Certified Master, Member Posts: 47 Guru

    Still not clear - if you could include the 2 example sets you have as input and a manually created example of the desired output you will find that you will get more rapid and focussed help.

     

    Andrew

  • 781194025781194025 Member Posts: 32 Contributor I
    Sure. So part of the problem is I got my data all 'mixed up' and i've got a bunch of files with the same information but perhaps 1 or 2 attributes different, missing or even with some data inside, and I need to merge them.

    For example, I run the same dataset thru an APIs, but because I run out of API calls some examples are missing. So I run the data thru a different API while I wait. Which also runs out and leaves some missing fields.

    Then I run the APIs again later, or run a 3rd API, and end up with a bunch of different files with almost all the same info but some missing examples and no easy way to just 'merge' the values that aren't missing when i'm trying to combine them all!!! (and I can't simply remove the attribute because some of the rows DO have data).

    Union Append is "nice", but it creates a similar problem where I have 2 rows, sometimes with exactly the same data! Which makes processing hard...

    Hope somebody can gimme advice!
  • 781194025781194025 Member Posts: 32 Contributor I
    Sure. So part of the problem is I got my data all 'mixed up' and i've got a bunch of files with the same information but perhaps 1 or 2 attributes different, missing or even with some data inside, and I need to merge them.

    For example, I run the same dataset thru an APIs, but because I run out of API calls some examples are missing. So I run the data thru a different API while I wait. Which also runs out and leaves some missing fields.

    Then I run the APIs again later, or run a 3rd API, and end up with a bunch of different files with almost all the same info but some missing examples and no easy way to just 'merge' the values that aren't missing when i'm trying to combine them all!!! (and I can't simply remove the attribute because some of the fields DO have data).

    Union Append is "nice", but it creates a similar problem where I have 2 rows, sometimes with exactly the same data! Which makes processing hard...

    Hope somebody can gimme advice!
  • 781194025781194025 Member Posts: 32 Contributor I
    http://pan.baidu.com/s/1c2Irq6W -- can download the example set from here
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    good morning @781194025 - ok I see what you are asking and yes, "Union Append" will not help unless the attributes are the same name and type.  You will need to rename your attributes such as "polarity_header" and "polarity" so that they are both the same before Union Append.  And they will need to also be the same type (i.e. both nominal).  IMO the operators are not allowing you to do what you want for a good reason - you would never want data science software to treat two attributes from different example sets as the same without the user ensuring that they are indeed the same.

     

    Scott

     

  • 781194025781194025 Member Posts: 32 Contributor I
    It's not about the types. Union Append just doesn't fill in missing fields. It simply appends, even when the ID is the same. I don't want two rows with the same ID but different data damnit!!

    I just want to join 2 example sets by ID and have the missing field come from whichever set has the information!!
  • 781194025781194025 Member Posts: 32 Contributor I
    Basically it should 'split data by groups', and then 'merge' the groups into a single example!
  • 781194025781194025 Member Posts: 32 Contributor I
    Hello,

    I have a first table : Computer, Info 1, Info 2
    and a second table : Computer, Info 3, Info 4

    I need to merge all the data in a same table to obtain :
    Computer, Info 1, Info 2, Info 3, Info 4

    Example : In the first table we have

    S64xxx, A1, A2,
    S64xxx B1, B2
    S64xxx,C1, C2

    In the second table we have :

    S64xxx, D1, D2
    S64xxx, E1,E2

    I want to have this result :

    S64xxx, A1, A2,D1,D2
    S64xxx B1, B2,E1,E2
    S64xxx,C1, C2,?,?

    Thanks for help
  • kludikovskykludikovsky Member Posts: 30 Maven

    Just try to help.

     

    The  information you provide is not consisten.

     

    IN the example you above you don't have a unique key (element Computer, data S64xxx) for the entries to match. According what you show the 'unique key' would be the position of the record.

     

    On the other hand in the previous statements you talk about runnig the same dataset against various API's which in turn returned different values. Now yould like to combine alls of them back into a simple set again. So I assume that Info1 and Info 3 resp Info 2 and Info4 are the same (have the same purpose) 



    So the first point would be to clearify if your data has a unique key.


    If you have a uniqe key try the following:

    1. Filter the examples from the first set which have empty entries
    2. Take this set and join them with the second set (take the join that keeps all entries from the first)
    3. Filter those that have now values for Info3 & 4 
    4. Select attributes and Rename
    5. Append (or Union) them with the 'Rest' from the first Filter - so those that are already ok)
    6. Do whatever you need to do with the remainder from the second filter
      (Before you talked about another API, so possibly repeat steps 2-6)

     

    In case you dont have a unique key

    (or something that you can use as this) you need to check if the sequence is the same (and ensure this) then possibly you have another option. Use the 'ID' /'Row No.' as the unique key. But be 1000% sure the sets are samq sequenced.

     

    Otherwise I might become very tricky.

    Hope that helps.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @781194025 - so as you can see, several people including myself are trying to help you as best as we can, but clearly there is a disconnect between our help and you solving your problem.  I must say that I cannot help further without really seeing 1) your XML process, and 2) your dataset.  This is the norm here on the community.  If you could please post them in this thread (rather than us going to some random link to download, etc...), I'd be happy to try to help further.


    Scott

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    @781194025

     

    Okay, I see you've spread yourself around the forum quite a bit, but this thread seems to be the main one. 

     

    Can I just summarize your problem a little more clearly based on my understanding?

    • You are pulling data from multiple sources
    • You are also sometimes pulling the same data through multiple API calls
    • You want to join these datasets together
    • You don't want to repeat records that you have pulled through already

    If this is correct then the following would do it:

    • Union Append (or just Append if the fields are the same)
    • Remove Duplicates
    • Join (if you have two datasets and want to combine them, or if you want to do join)

    The Union Append is probably not necessary, but it seems you're really keen to use it so let's keep it in.  :smileylol:

    I'm not going to give you a full walkthrough here as the forum seems to have become quite clogged.  But feel free to PM me and we can chat further if you're struggling. 

     

     

Sign In or Register to comment.