how to combine 2 example sets where both have some missing values?

781194025 · October 2017

This has been frustrating the !!!$@#Q$ OUT OF ME!!!

2 different sets, similar attributes, IDs match up fine.. so why isn't there an option to combine them and use the example from whichever one isn't missing?!??!

Join left? NOPE!
Join right? NOPE!
Join outer? NOPE!
Append? NOPE
Union? NOPE!
Superset? NOPE!

What am I missing?!?! HELP!

the datasets look like this>

A B C

1 x y ?

2 x y ?

3 ? z z

Andrew · October 2017

Hello

Could you provide the 2 example sets that are input and also what you need as the output. Currently, I'm only seeing one example set as input

regads

Andrew

FBT · October 2017

As @Andrew said, a complete example makes it easier for community members to help you. Having said that, I believe the building block union append that another user created could help to solve your issue.

781194025 · October 2017

For example the 2 example sets only differ by the results in Sentiment and SentimentScore, which is missing. How do I get them to merge into a single row?

Andrew · October 2017

Still not clear - if you could include the 2 example sets you have as input and a manually created example of the desired output you will find that you will get more rapid and focussed help.

Andrew

781194025 · October 2017

Sure. So part of the problem is I got my data all 'mixed up' and i've got a bunch of files with the same information but perhaps 1 or 2 attributes different, missing or even with some data inside, and I need to merge them.

For example, I run the same dataset thru an APIs, but because I run out of API calls some examples are missing. So I run the data thru a different API while I wait. Which also runs out and leaves some missing fields.

Then I run the APIs again later, or run a 3rd API, and end up with a bunch of different files with almost all the same info but some missing examples and no easy way to just 'merge' the values that aren't missing when i'm trying to combine them all!!! (and I can't simply remove the attribute because some of the rows DO have data).

Union Append is "nice", but it creates a similar problem where I have 2 rows, sometimes with exactly the same data! Which makes processing hard...

Hope somebody can gimme advice!

781194025 · October 2017

Sure. So part of the problem is I got my data all 'mixed up' and i've got a bunch of files with the same information but perhaps 1 or 2 attributes different, missing or even with some data inside, and I need to merge them.

For example, I run the same dataset thru an APIs, but because I run out of API calls some examples are missing. So I run the data thru a different API while I wait. Which also runs out and leaves some missing fields.

Then I run the APIs again later, or run a 3rd API, and end up with a bunch of different files with almost all the same info but some missing examples and no easy way to just 'merge' the values that aren't missing when i'm trying to combine them all!!! (and I can't simply remove the attribute because some of the fields DO have data).

Union Append is "nice", but it creates a similar problem where I have 2 rows, sometimes with exactly the same data! Which makes processing hard...

Hope somebody can gimme advice!

781194025 · October 2017

http://pan.baidu.com/s/1c2Irq6W -- can download the example set from here

sgenzer · October 2017

good morning @781194025 - ok I see what you are asking and yes, "Union Append" will not help unless the attributes are the same name and type. You will need to rename your attributes such as "polarity_header" and "polarity" so that they are both the same before Union Append. And they will need to also be the same type (i.e. both nominal). IMO the operators are not allowing you to do what you want for a good reason - you would never want data science software to treat two attributes from different example sets as the same without the user ensuring that they are indeed the same.

Scott

781194025 · October 2017

It's not about the types. Union Append just doesn't fill in missing fields. It simply appends, even when the ID is the same. I don't want two rows with the same ID but different data damnit!!

I just want to join 2 example sets by ID and have the missing field come from whichever set has the information!!

781194025 · October 2017

Basically it should 'split data by groups', and then 'merge' the groups into a single example!

781194025 · October 2017

Hello,

I have a first table : Computer, Info 1, Info 2
and a second table : Computer, Info 3, Info 4

I need to merge all the data in a same table to obtain :
Computer, Info 1, Info 2, Info 3, Info 4

Example : In the first table we have

S64xxx, A1, A2,
S64xxx B1, B2
S64xxx,C1, C2

In the second table we have :

S64xxx, D1, D2
S64xxx, E1,E2

I want to have this result :

S64xxx, A1, A2,D1,D2
S64xxx B1, B2,E1,E2
S64xxx,C1, C2,?,?

Thanks for help

kludikovsky · October 2017

Just try to help.

The information you provide is not consisten.

IN the example you above you don't have a unique key (element Computer, data S64xxx) for the entries to match. According what you show the 'unique key' would be the position of the record.

On the other hand in the previous statements you talk about runnig the same dataset against various API's which in turn returned different values. Now yould like to combine alls of them back into a simple set again. So I assume that Info1 and Info 3 resp Info 2 and Info4 are the same (have the same purpose)

So the first point would be to clearify if your data has a unique key.

If you have a uniqe key try the following:

Filter the examples from the first set which have empty entries
Take this set and join them with the second set (take the join that keeps all entries from the first)
Filter those that have now values for Info3 & 4
Select attributes and Rename
Append (or Union) them with the 'Rest' from the first Filter - so those that are already ok)
Do whatever you need to do with the remainder from the second filter
(Before you talked about another API, so possibly repeat steps 2-6)

In case you dont have a unique key

(or something that you can use as this) you need to check if the sequence is the same (and ensure this) then possibly you have another option. Use the 'ID' /'Row No.' as the unique key. But be 1000% sure the sets are samq sequenced.

Otherwise I might become very tricky.

Hope that helps.

sgenzer · October 2017

hello @781194025 - so as you can see, several people including myself are trying to help you as best as we can, but clearly there is a disconnect between our help and you solving your problem. I must say that I cannot help further without really seeing 1) your XML process, and 2) your dataset. This is the norm here on the community. If you could please post them in this thread (rather than us going to some random link to download, etc...), I'd be happy to try to help further.

Scott

JEdward · October 2017

@781194025

Okay, I see you've spread yourself around the forum quite a bit, but this thread seems to be the main one.

Can I just summarize your problem a little more clearly based on my understanding?

You are pulling data from multiple sources
You are also sometimes pulling the same data through multiple API calls
You want to join these datasets together
You don't want to repeat records that you have pulled through already

If this is correct then the following would do it:

Union Append (or just Append if the fields are the same)
Remove Duplicates
Join (if you have two datasets and want to combine them, or if you want to do join)

The Union Append is probably not necessary, but it seems you're really keen to use it so let's keep it in. :smileylol:

I'm not going to give you a full walkthrough here as the forum seems to have become quite clogged. But feel free to PM me and we can chat further if you're struggling.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

how to combine 2 example sets where both have some missing values?

What am I missing?!?! HELP!

Answers