Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
mystery science data mining problem 3000
MBA_Data_Miner
Member Posts: 21 Contributor II
in Help
In the not too distant future-
I have an idea I would like to try out ... but I have no idea what operators could accomplish this:
What I would like to do is scan a database ( flat file currently) and find any records with matching fields.
Then assign a "relationship ID" to each field to help find relationships in the data.
( later I would like to include fuzzy matching as well above a certain match threshold, like Jaccard similarity or something similar).
Any thoughts?
Best regards, J.
I have an idea I would like to try out ... but I have no idea what operators could accomplish this:
What I would like to do is scan a database ( flat file currently) and find any records with matching fields.
Then assign a "relationship ID" to each field to help find relationships in the data.
( later I would like to include fuzzy matching as well above a certain match threshold, like Jaccard similarity or something similar).
Any thoughts?
Best regards, J.
0
Answers
Just thought I'd revive this topic. Fuzzy matching is possible with the cross distances operator. Break the field/fields for comparison into ngrams and calculate the other records in your dataset that are closest.
Regarding programmatically discovering relationships: I just stumbled upon this fantastic sounding project that RapidMiner are working on alongside the University of Mannheim.
"Key idea
Analysts increasingly have the problem that they know that some data which they need for a project is available somewhere on the Web or in the corporate intranet, but they are unable to find the data. The goal of the 'Data Search for Data Mining' (DS4DM) project is to extend the data mining plattform Rapidminer with data search and data integration functionalities which enable analysts to find relevant data in potentially very large data corpora, and to semi-automatically integrate the discovered data with existing local data."
You want entity relationships back to your database? How about all of wikipedia?
ds4dm.de/en/about/
http://ub-madoc.bib.uni-mannheim.de/40718/1/DataSearchDemo.pdf