The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Can RapidMiner de-identify or make data anonymous?
CraigBostonUSA
Employee, Member Posts: 34 RM Team Member
Best Answer
-
sgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
yup. Try "Obfuscate".
Scott1
Answers
I think "obfuscate" will mask some variables, but that's all it does. Look into other specialized software to do anonymization (k-anonymization, l-diversity, etc). I have one in mind:
I've used it and it is very powerful:
http://arx.deidentifier.org/
Obfuscate will anonymize nominal attribute names and nominal data values:
I agree that this is not extremely robust and would recommend a proper hashing algorithm of some kind if you truly want to protect your data and/or you have numerical values. Often I will create a hash myself using a form of public key crypto rather than use obfuscate.
Thanks for the suggestion for that software, @earmijo - I will need to check that out.
Scott
Hi Scott. I am currently learning about the subject. The handling of PII (personal identifiers) by completely masking them is standard. The key problem with anonymization is what to do with quasi-identifiers. If you completely mask them, the dataset is rendered useless. Finding the right balance between usefulness and anonymity seems to be goal (there are some researchers that believe this compromise is not possible. See for instance: The False Promise of Anonymization : "Data can be either useful or perfectly anonymous but never both.") A nice intro to the main issues is discussed in this video
https://www.youtube.com/watch?v=O3hxp117EHs
that is very interesting @earmijo. Thank you for sharing. I worked a lot with PII-sensitive data when I was freelancing (mostly FERPA compliance here in the USA - protecting PII of student data from schools) and I like the way you phrase this tough quandary. Food for thought.
Scott
Hey @earmijo,
looks like a lot of brain food for me . Since your linked library is java - have you considered to embed this into RM as operators?
Cheers,
Martin
Dortmund, Germany
Hi @mschmitz
That would be awesome (an extension linking both applications). Unfortunately, it is beyond my skills. I know how to drive the car, but I have no idea what's under the hood :-)
Hi,
an important question first - how good is your Java?
Best,
Martin
Dortmund, Germany
But clearly the best option would be to develop a full extension using their API toolkit (they have a lot of options)!
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
excellent idea, @Telcontar120. Noted. @mschmitz and I are making some (very slow) progress on this project. Delay is purely my fault - I am an abysmal Java programmer and the rest of the dev team is completely booked with RM8.0. We will get there!
Scott