Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
"Compare two customer databases"
xiaobo_sxb
Member Posts: 17 Contributor II
Hi
I have two customer tables which contains their information like name, address, phone etc. Most of them are actually the same customer set. I'd like to map them if they are the same, by comparing above fields. Both table has more than 10K records. Does anybody know how to do that in Rapidminer?
Best Regards
Steven
I have two customer tables which contains their information like name, address, phone etc. Most of them are actually the same customer set. I'd like to map them if they are the same, by comparing above fields. Both table has more than 10K records. Does anybody know how to do that in Rapidminer?
Best Regards
Steven
Tagged:
0
Answers
The Join operator lets you join tables together. You could also use a distance to similarity approach to see what records are close to one another.
Regards,
Andrew
Thank you for your reply. I still have questions for your proposal.
First, the join operator require the two dataset have the same ID (the key). For my case, I don't have the same ID.
For the "data to similarity" operator, still not good enough. First, it will create a cross join across all the rows, in my case I have more than 10K rows for both of the tables, and I doubt the performance. Second, even I have the similarity score, I don't know the threshold for determining the possibility of two rows as the same customer. Is it possible to generate the possibility to say, how much percent of confidence we can say the two customers are actually the same one?
Regards
Steven
Well if there is no common ID then there is obviously no way to use Join.
Actually a better operator would be Cross Distances which allows the selection of the top k nearest. The threshold completely depends on the data you have and I can't answer that.
Performance may not be that bad; you have to try it.
regards
Andrew
Regards
Steven
Andrew
An outer join is the join type you'll want to use in your case to spot customers that are part of only one of the tables.
Best regards,
Marius