Deduplicate names in RapidMiner with Rosette
New Rosette API endpoint and Rapidminer operator for data cleansing
Recognizing and reconciling duplicate records is a common headache of database management especially when the differences are subtle and likely to be missed by most deduping systems. If your records include duplicate records that include misspellings, nicknames, initials, and non-Latin scripts, you may be missing connections, keeping your agents and team members from the information they need.
Rosette API launched a new /dedupe endpoint which utilizes our industry-leading fuzzy name matching to connect database records that contain moderate, or “fuzzy,” variations. Unlike other deduplicators that can only pick out exact matches, Rosette allows the user to find and reconcile similar records for cleaner databases. To make this functionality more easily accessible, we simultaneously released a “Deduplicate Names” operator for Rapidminer Studio which uses the new endpoint under the hood.
The Rosette Deduplicate Names operator identifies candidate duplicates from a list of names by assigning “group ids” to groups of matching names. The operator can process lists of up to 10,000 English names and assigns group ids based on a user-specified match threshold. The threshold sets the minimum similarity score required for two names to be considered duplicates. Thresholds can be set by clicking on the operator and entering a value between 0 and 1 in the “Threshold” field. We recommend starting with a .8 threshold, and experimenting with higher or lower values depending upon your use case and results.
Given a list of names as input, the output is a list of cluster IDs (integers) for each name—not in any particular order. The output may then be sorted by cluster ID to group together possible duplicate names. For example:
Further refine your results with additional fields
When you submit a name-deduplication request in Rapidminer, you need only input a list of names; however, you can also set the entity type–if known–to person (default), location, or organization to improve accuracy.
The Rosette API /deduplication endpoint also supports additional language and script fields beyond those offered in Rapidminer to further improve your results.
Try it yourself
If you need to process large volumes of records or would prefer not to send your data to the cloud, talk to our sales team about custom solutions and on-premise deployments.