Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Remove Duplicate Examples
Hi,
I'm working on some genetics data. I have 6151 examples and 157 attributes. My attributes are patient IDs and my examples are gene names. My goal is to transpose the matrix table. Here is a sample of my data set:
My problem now is I can't use the "Transpose" operator because there are duplicate row/example names. In order to transpose it, the attribute name needs to be unique. I wish to find all the pairs that have the same example names and edit their names. I was thinking about doing a loop, but I don't really know where to start and what operators to use to change the row names. Can somebody give me some advises on how to achieve this?
Thank you!
I'm working on some genetics data. I have 6151 examples and 157 attributes. My attributes are patient IDs and my examples are gene names. My goal is to transpose the matrix table. Here is a sample of my data set:
My problem now is I can't use the "Transpose" operator because there are duplicate row/example names. In order to transpose it, the attribute name needs to be unique. I wish to find all the pairs that have the same example names and edit their names. I was thinking about doing a loop, but I don't really know where to start and what operators to use to change the row names. Can somebody give me some advises on how to achieve this?
Thank you!
1
Best Answers
-
cdaponte Member Posts: 29 MavenYou can use the Remove duplicates operator, and select the output that shows you the duplicates examples. Once you get the duplicates you can rename them with the operator "Rename" or "Replace".6
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi @gracewei,
Nice challenge, but honestly, I don't see any solution to perform automatically what you want to do with RapidMiner's native operator(s)...
... however there is a (relativ) simple solution using a Python script to perform this task.
Basically, the script add a number to the name of the duplicate and this number is incremented according to the number of duplicate(s) of a name.
Concretely the output example set looks like that :
After executing this process, all the names/values of the "gene_name" attribute are unique et thus you can transpose your exampleset...
To execute this process, you need to :
- Install Python on your computer
- Install the Python Scripting extension in RapidMiner (from the Marketplace)
The process :<?xml version="1.0" encoding="UTF-8"?><process version="9.4.000-BETA"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.4.000-BETA" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" breakpoints="after" class="read_excel" compatibility="9.4.000-BETA" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34"> <parameter key="excel_file" value="D:\Lionel\Formations_DataScience\Rapidminer\Tests_Rapidminer\Rename_Duplicates\Rename_Duplicates.xlsx"/> <parameter key="sheet_selection" value="sheet number"/> <parameter key="sheet_number" value="1"/> <parameter key="imported_cell_range" value="A1"/> <parameter key="encoding" value="SYSTEM"/> <parameter key="first_row_as_names" value="true"/> <list key="annotations"/> <parameter key="date_format" value=""/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="read_all_values_as_polynominal" value="false"/> <list key="data_set_meta_data_information"> <parameter key="0" value="gene_name.true.polynominal.attribute"/> <parameter key="1" value="Target.true.integer.attribute"/> </list> <parameter key="read_not_matching_values_as_missings" value="false"/> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> </operator> <operator activated="true" class="python_scripting:execute_python" compatibility="9.3.000" expanded="true" height="103" name="Execute Python" width="90" x="313" y="34"> <parameter key="script" value="import pandas from collections import Counter # Counter counts the number of occurrences of each item from itertools import tee, count # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def uniquify(seq, suffs = count(1)): """Make all the items unique by adding a suffix (1, 2, etc). `seq` is mutable sequence of strings. `suffs` is an optional alternative suffix iterable. """ not_unique = [k for k,v in Counter(seq).items() if v>1] # so we have: ['name', 'zip'] # suffix generator dict - e.g., {'name': <my_gen>, 'zip': <my_gen>} suff_gens = dict(zip(not_unique, tee(suffs, len(not_unique)))) for idx,s in enumerate(seq): try: suffix = str(next(suff_gens[s])) except KeyError: # s was unique continue else: seq[idx] += suffix def rm_main(data): mylist = data['gene_name'] uniquify(mylist, (f'_{x!s}' for x in range(1, 100))) data['gene_name'] = mylist # connect 2 output ports to see the results return data"/> <parameter key="notebook_cell_tag_filter" value=""/> <parameter key="use_default_python" value="true"/> <parameter key="package_manager" value="conda (anaconda)"/> </operator> <connect from_op="Read Excel" from_port="output" to_op="Execute Python" to_port="input 1"/> <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Hope this will help in the future ...
Regards,
Lionel
4
Answers
You can use rapidminer operators to transpose and rename. But the tricky part is to create a new list of name without duplicates
After transformation,
Cheers,
YY
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts