Options

handling duplicated columns, but with text !

Nada_Faisal1991Nada_Faisal1991 Member Posts: 3 Newbie
edited September 2020 in Help
hello there fellow miners,

I'm a Rapidminer beginner, and I am trying to detect then delete duplicated columns for an example set that holds text rather than numbers.
with numbers it was easy, removing correlation did the job perfectly.
but things got complicated with text, is there a way where I can either a) do something similar to the correlation removal in numbers or b) convert the text to numbers but keep the columns intact rather than splitting them by value like the output of the process "Nominal to Numerical" ?

thank you. :)

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    if you want to delete exact duplicate texts you can use the Remove Duplicates operator. If you want to remove similar texts like:
    "RapidMiner is great!" and "rapidminer is great", then it gets a bit more tricky.
    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    Nada_Faisal1991Nada_Faisal1991 Member Posts: 3 Newbie
    edited September 2020
    mschmitz  thank you for the response but Filter Example did not work with me, or at least I did not know how to make it work.

    If I have a table with the following text content :

          1             2           3
    -------------------------------------
    l   Tree   l   Fruit   l    Tree   l
    l   Fruit   l   Fruit   l    Fruit   l
    l   Fruit   l   True   l    Fruit   l
    l   Tree   l   Tree   l    Tree   l
    -------------------------------------

    I need to remove column 3 or know that column 3 is the exact duplicate of column 1,

    thank you all for your wisdom.
  • Options
    VanlalVanlal Member Posts: 12 Contributor II
    Hi,
      You can use the below process
    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.6.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="136">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="1,2,3&#10;Tree,Fruit,Tree&#10;Fruit,Fruit,Fruit&#10;Fruit,True,Fruit&#10;Tree,Tree,Tree"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="transpose" compatibility="9.6.000" expanded="true" height="82" name="Transpose" width="90" x="179" y="136"/>
          <operator activated="true" class="remove_duplicates" compatibility="9.6.000" expanded="true" height="103" name="Remove Duplicates" width="90" x="313" y="136">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="treat_missing_values_as_duplicates" value="false"/>
          </operator>
          <operator activated="true" class="transpose" compatibility="9.6.000" expanded="true" height="82" name="Transpose (2)" width="90" x="447" y="136"/>
          <operator activated="true" class="select_attributes" compatibility="9.6.000" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="136">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="id"/>
            <parameter key="attributes" value="id"/>
            <parameter key="regular_expression" value="id"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="true"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Transpose" to_port="example set input"/>
          <connect from_op="Transpose" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
          <connect from_op="Remove Duplicates" from_port="example set output" to_op="Transpose (2)" to_port="example set input"/>
          <connect from_op="Transpose (2)" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Hope this is the solution ..
  • Options
    Nada_Faisal1991Nada_Faisal1991 Member Posts: 3 Newbie
    edited September 2020
    Vanlal   thank you so much that worked  :smiley:

Sign In or Register to comment.