mystery science data mining problem 3000

MBA_Data_MinerMBA_Data_Miner Member Posts: 21 Contributor II
In the not too distant future-

I have an idea I would like to try out ... but I have no idea what operators could accomplish this:

What I would like to do is scan a database ( flat file currently) and find any records with matching fields.

Then assign a "relationship ID" to each field to help find relationships in the data.


( later I would like to include fuzzy matching as well above a certain match threshold, like Jaccard similarity or something similar).


Any thoughts?

Best regards, J.

Answers

  • MBA_Data_MinerMBA_Data_Miner Member Posts: 21 Contributor II
    also to clarify my "database" is nothing special... likely just a small Microsoft access database for testing purposes, the flat file would be a query output from the accdb file, probably a csv or excel file for simplicity.
    MBA_Data_Miner wrote:

    In the not too distant future-

    I have an idea I would like to try out ... but I have no idea what operators could accomplish this:

    What I would like to do is scan a database ( flat file currently) and find any records with matching fields.

    Then assign a "relationship ID" to each field to help find relationships in the data.


    ( later I would like to include fuzzy matching as well above a certain match threshold, like Jaccard similarity or something similar).


    Any thoughts?

    Best regards, J.
  • MBA_Data_MinerMBA_Data_Miner Member Posts: 21 Contributor II
    Update: Additionally research online has revealed that my mystery problem is called "entity resolution".  Does anyone have experience with this? The project would be on a flat excel file as well rather than an access file.
  • MBA_Data_MinerMBA_Data_Miner Member Posts: 21 Contributor II
    Actually, Rapidminer should do an entity resolution extension. It would be really useful. The basic idea is to programmatically discover relationships in your data. It goes beyond simple duplicate matching between two records, but could extend across several records... 
    MBA_Data_Miner wrote:

    Update: Additionally research online has revealed that my mystery problem is called "entity resolution".  Does anyone have experience with this? The project would be on a flat excel file as well rather than an access file.
  • MBA_Data_MinerMBA_Data_Miner Member Posts: 21 Contributor II
    I am wondering if some sort of nearest neighbors clustering would be appropriate for this type of problem... Any thoughts?
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Hi there,

    Just thought I'd revive this topic.  Fuzzy matching is possible with the cross distances operator.  Break the field/fields for comparison into ngrams and calculate the other records in your dataset that are closest. 

    Regarding programmatically discovering relationships: I just stumbled upon this fantastic sounding project that RapidMiner are working on alongside the University of Mannheim. 

    "Key idea

    Analysts increasingly have the problem that they know that some data which they need for a project is available somewhere on the Web or in the corporate intranet, but they are unable to find the data. The goal of the 'Data Search for Data Mining' (DS4DM) project is to extend the data mining plattform Rapidminer with data search and data integration functionalities which enable analysts to find relevant data in potentially very large data corpora, and to semi-automatically integrate the discovered data with existing local data."

    You want entity relationships back to your database?  How about all of wikipedia? 

    ds4dm.de/en/about/
    http://ub-madoc.bib.uni-mannheim.de/40718/1/DataSearchDemo.pdf

Sign In or Register to comment.