Can RapidMiner de-identify or make data anonymous?

CraigBostonUSACraigBostonUSA Administrator, Employee, Member Posts: 34 RM Team Member
edited December 2018 in Help

Is it possible to anonymize or de-identify data with RapidMiner?

 

Thanks!download.jpg

Best Answer

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Solution Accepted

    yup.  Try "Obfuscate".


    Scott

Answers

  • earmijoearmijo Member Posts: 270 Unicorn

    I think "obfuscate" will mask some variables, but that's all it does. Look into other specialized software to do anonymization (k-anonymization, l-diversity, etc). I have one in mind: 

    I've used it and it is very powerful:

     

    http://arx.deidentifier.org/

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Obfuscate will anonymize nominal attribute names and nominal data values:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve Iris" width="90" x="179" y="136">
    <parameter key="repository_entry" value="//Samples/data/Iris"/>
    </operator>
    <operator activated="true" class="productivity:obfuscate" compatibility="7.6.001" expanded="true" height="82" name="Obfuscate" width="90" x="380" y="136">
    <parameter key="use_local_random_seed" value="true"/>
    </operator>
    <connect from_op="Retrieve Iris" from_port="output" to_op="Obfuscate" to_port="example set input"/>
    <connect from_op="Obfuscate" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I agree that this is not extremely robust and would recommend a proper hashing algorithm of some kind if you truly want to protect your data and/or you have numerical values.  Often I will create a hash myself using a form of public key crypto rather than use obfuscate.

     

    Thanks for the suggestion for that software, @earmijo - I will need to check that out.

     

    Scott

  • earmijoearmijo Member Posts: 270 Unicorn

    Hi Scott. I am currently learning about the subject. The handling of PII (personal identifiers) by completely masking them is standard. The key problem with anonymization is what to do with quasi-identifiers. If you completely mask them, the dataset is rendered useless. Finding the right balance between usefulness and anonymity seems to be goal (there are some researchers that believe this compromise is not possible. See for instance:  The False Promise of Anonymization : "Data can be either useful or perfectly anonymous but never both.") A nice intro to the main issues is discussed in this video

     

    https://www.youtube.com/watch?v=O3hxp117EHs

     

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    that is very interesting @earmijo.  Thank you for sharing.  I worked a lot with PII-sensitive data when I was freelancing (mostly FERPA compliance here in the USA - protecting PII of student data from schools) and I like the way you phrase this tough quandary.  Food for thought.

     

    Scott

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hey @earmijo,

    looks like a lot of brain food for me :). Since your linked library is java - have you considered to embed this into RM as operators?

     

    Cheers,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • earmijoearmijo Member Posts: 270 Unicorn

    Hi @mschmitz

     

    That would be awesome (an extension linking both applications). Unfortunately, it is beyond my skills. I know how to drive the car, but I have no idea what's under the hood :-) 

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    an important question first - how good is your Java? :)

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Maybe @sgenzer another simpler option would be add this one to the API dev list since it looks like some of the basics would be interoperable that way: http://arx.deidentifier.org/api/

    But clearly the best option would be to develop a full extension using their API toolkit (they have a lot of options)!
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    excellent idea, @Telcontar120.  Noted.  @mschmitz and I are making some (very slow) progress on this project.  Delay is purely my fault - I am an abysmal Java programmer and the rest of the dev team is completely booked with RM8.0.  We will get there!

     

    Scott

     

     

Sign In or Register to comment.