handling duplicated columns, but with text !

Nada_Faisal1991 · September 2020

hello there fellow miners,

I'm a Rapidminer beginner, and I am trying to detect then delete duplicated columns for an example set that holds text rather than numbers.
with numbers it was easy, removing correlation did the job perfectly.
but things got complicated with text, is there a way where I can either a) do something similar to the correlation removal in numbers or b) convert the text to numbers but keep the columns intact rather than splitting them by value like the output of the process "Nominal to Numerical" ?

thank you.

MartinLiebig · September 2020

Hi @Nada_Faisal1991 ,

if you want to delete exact duplicate texts you can use the Remove Duplicates operator. If you want to remove similar texts like:

"RapidMiner is great!" and "rapidminer is great", then it gets a bit more tricky.

Best,

Martin

Nada_Faisal1991 · September 2020

mschmitz thank you for the response but Filter Example did not work with me, or at least I did not know how to make it work.

If I have a table with the following text content :

1 2 3
-------------------------------------
l Tree l Fruit l Tree l
l Fruit l Fruit l Fruit l
l Fruit l True l Fruit l
l Tree l Tree l Tree l
-------------------------------------

I need to remove column 3 or know that column 3 is the exact duplicate of column 1,

thank you all for your wisdom.

Vanlal · September 2020

Hi,
You can use the below process

<?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="utility:create_exampleset" compatibility="9.6.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="136">
        <parameter key="generator_type" value="comma separated text"/>
        <parameter key="number_of_examples" value="100"/>
        <parameter key="use_stepsize" value="false"/>
        <list key="function_descriptions"/>
        <parameter key="add_id_attribute" value="false"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="input_csv_text" value="1,2,3&#10;Tree,Fruit,Tree&#10;Fruit,Fruit,Fruit&#10;Fruit,True,Fruit&#10;Tree,Tree,Tree"/>
        <parameter key="column_separator" value=","/>
        <parameter key="parse_all_as_nominal" value="false"/>
        <parameter key="decimal_point_character" value="."/>
        <parameter key="trim_attribute_names" value="true"/>
      </operator>
      <operator activated="true" class="transpose" compatibility="9.6.000" expanded="true" height="82" name="Transpose" width="90" x="179" y="136"/>
      <operator activated="true" class="remove_duplicates" compatibility="9.6.000" expanded="true" height="103" name="Remove Duplicates" width="90" x="313" y="136">
        <parameter key="attribute_filter_type" value="all"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="treat_missing_values_as_duplicates" value="false"/>
      </operator>
      <operator activated="true" class="transpose" compatibility="9.6.000" expanded="true" height="82" name="Transpose (2)" width="90" x="447" y="136"/>
      <operator activated="true" class="select_attributes" compatibility="9.6.000" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="136">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="id"/>
        <parameter key="attributes" value="id"/>
        <parameter key="regular_expression" value="id"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Transpose" to_port="example set input"/>
      <connect from_op="Transpose" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
      <connect from_op="Remove Duplicates" from_port="example set output" to_op="Transpose (2)" to_port="example set input"/>
      <connect from_op="Transpose (2)" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Hope this is the solution ..

Nada_Faisal1991 · September 2020

Vanlal thank you so much that worked

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

handling duplicated columns, but with text !

Answers