Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
How to remove near duplicates e.g egg and eggs
Hi,
I am a newbie and I apologize if the question is trivial.
I want to know how to find near duplicate entries in one attribute. For instance I want to treat tomato soup and tomatoe soup as a duplicate, egg and eggs as a duplicate. In your solution, kindly add a screenshot of the operators that you recommend using.
Please help
I am a newbie and I apologize if the question is trivial.
I want to know how to find near duplicate entries in one attribute. For instance I want to treat tomato soup and tomatoe soup as a duplicate, egg and eggs as a duplicate. In your solution, kindly add a screenshot of the operators that you recommend using.
Please help
0
Best Answers
-
rfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 UnicornHello @Khiba,
I don't know how your data looks like, but here is a process that does what you want. You need the Text Processing extension to run this process (and probably the Operator Toolbox if you run RapidMiner 9.1 or earlier).<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="utility:create_exampleset" compatibility="9.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34"> <parameter key="generator_type" value="comma separated text"/> <parameter key="number_of_examples" value="100"/> <parameter key="use_stepsize" value="false"/> <list key="function_descriptions"/> <parameter key="add_id_attribute" value="false"/> <list key="numeric_series_configuration"/> <list key="date_series_configuration"/> <list key="date_series_configuration (interval)"/> <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="input_csv_text" value="id,text 1,tomato tomatoes egg eggs soup salad 2,potato potatoes person people sausage sausages"/> <parameter key="column_separator" value=","/> <parameter key="parse_all_as_nominal" value="false"/> <parameter key="decimal_point_character" value="."/> <parameter key="trim_attribute_names" value="true"/> </operator> <operator activated="true" class="set_role" compatibility="9.2.000" expanded="true" height="82" name="Set Role" width="90" x="179" y="34"> <parameter key="attribute_name" value="id"/> <parameter key="target_role" value="id"/> <list key="set_additional_roles"/> </operator> <operator activated="true" class="nominal_to_text" compatibility="9.2.000" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34"> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="nominal"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="file_path"/> <parameter key="block_type" value="single_value"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="single_value"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="447" y="34"> <parameter key="select_attributes_and_weights" value="false"/> <list key="specify_weights"/> </operator> <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="581" y="34"> <parameter key="create_word_vector" value="true"/> <parameter key="vector_creation" value="TF-IDF"/> <parameter key="add_meta_information" value="true"/> <parameter key="keep_text" value="true"/> <parameter key="prune_method" value="none"/> <parameter key="prune_below_percent" value="3.0"/> <parameter key="prune_above_percent" value="30.0"/> <parameter key="prune_below_rank" value="0.05"/> <parameter key="prune_above_rank" value="0.95"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <process expanded="true"> <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"> <parameter key="mode" value="non letters"/> <parameter key="characters" value=".:"/> <parameter key="language" value="English"/> <parameter key="max_token_length" value="3"/> </operator> <operator activated="true" class="text:stem_snowball" compatibility="8.1.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="179" y="34"> <parameter key="language" value="English"/> </operator> <connect from_port="document" to_op="Tokenize" to_port="document"/> <connect from_op="Tokenize" from_port="document" to_op="Stem (Snowball)" to_port="document"/> <connect from_op="Stem (Snowball)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <connect from_op="Create ExampleSet" from_port="output" to_op="Set Role" to_port="example set input"/> <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/> <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/> <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/> <connect from_op="Process Documents" from_port="example set" to_port="result 1"/> <connect from_op="Process Documents" from_port="word list" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> </process> </operator> </process>
Hope this helps,
Rodrigo.2 -
rfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 UnicornForgot to explain the process. I apologise, I'm a bit sick.
What I did was:- Convert the data to documents (as required for text processing)
- Process each document, and inside:
- Tokenize (convert text into tokens)
- Stem (I used the Snowball stemming algorithm).
- The result you want can be a wordlist that you can convert it to data again or do what you need. It also comes as an example.
All the best,
Rodrigo.
2
Answers