How to apply "Remove Duplicates" on each attribute of an example set?

SylvainM · May 2018

Hello everyone,

I'm a total beginner with Rapid Miner, and I apologize if my question is an obvious one or if it has been posted under another title. I work in Humanities, so I'm usually not very familiar with computing. However, Rapid Miner has given me amazing results so far, and I'd like to dig further.

I have a dateset of songs in which, for each Title, I have all the Artists involved in an column, all the Songwriters in a second column, and all the Producers in a third. I have been able to split these columns in order to have only a single name per box (Artist_1, Artist_2, Songwriter_1,...). The issue I have is that, for each song, an artist can also be the songwriter and/or the producer. I would like to get the name of each person involved only once.

As I have 60.000 songs, I first thought to do a Loop to automatically select one song after the other and apply "Remove Duplicates" for each of them. I would have expected a collection of individual sets, i.e. each song with only each name once. But I can't figure out how to automatically select only one song after another...

I hope my description is clear enough.

Thank you for your help :-)

Edin_Klapic · May 2018

Hi @SylvainM,

Done

Happy Mining,

Edin

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="8.1.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
        <parameter key="csv_file" value="C:\Users\EdinKlapic\Desktop\Song Data-Tableau 1.csv"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Title.true.polynominal.attribute"/>
          <parameter key="1" value="Imprint.true.polynominal.attribute"/>
          <parameter key="2" value="Artist.true.polynominal.attribute"/>
          <parameter key="3" value="Label.true.polynominal.attribute"/>
          <parameter key="4" value="Chart Weeks.true.integer.attribute"/>
          <parameter key="5" value="Debut.true.polynominal.attribute"/>
          <parameter key="6" value="Debut Pos.true.integer.attribute"/>
          <parameter key="7" value="Peak.true.polynominal.attribute"/>
          <parameter key="8" value="Peak Pos.true.integer.attribute"/>
          <parameter key="9" value="Album title.true.polynominal.attribute"/>
          <parameter key="10" value="Album release date.true.polynominal.attribute"/>
          <parameter key="11" value="Songwriters.true.polynominal.attribute"/>
          <parameter key="12" value="Producers.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
        <list key="function_descriptions">
          <parameter key="Year" value="prefix([Album release date],4)"/>
        </list>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="313" y="34">
        <list key="function_descriptions">
          <parameter key="People" value="concat(Artist,&quot;,&quot;,Songwriters)"/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="8.1.003" expanded="true" height="82" name="Select Attributes (2)" width="90" x="447" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Title|Year|People"/>
      </operator>
      <operator activated="true" class="loop_examples" compatibility="8.1.003" expanded="true" height="103" name="Loop Examples" width="90" x="581" y="34">
        <process expanded="true">
          <operator activated="true" class="filter_example_range" compatibility="8.1.003" expanded="true" height="82" name="Filter Example Range" width="90" x="45" y="34">
            <parameter key="first_example" value="%{example}"/>
            <parameter key="last_example" value="%{example}"/>
          </operator>
          <operator activated="true" class="work_on_subset" compatibility="8.1.003" expanded="true" height="82" name="Work on Subset" width="90" x="179" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="People"/>
            <process expanded="true">
              <operator activated="true" class="split" compatibility="8.1.003" expanded="true" height="82" name="Split (2)" width="90" x="45" y="34"/>
              <operator activated="true" class="trim" compatibility="8.1.003" expanded="true" height="82" name="Trim (2)" width="90" x="179" y="34"/>
              <operator activated="true" class="transpose" compatibility="8.1.003" expanded="true" height="82" name="Transpose (2)" width="90" x="313" y="34"/>
              <operator activated="true" class="select_attributes" compatibility="8.1.003" expanded="true" height="82" name="Select Attributes (3)" width="90" x="45" y="187">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="id"/>
                <parameter key="invert_selection" value="true"/>
                <parameter key="include_special_attributes" value="true"/>
              </operator>
              <operator activated="true" class="remove_duplicates" compatibility="8.1.003" expanded="true" height="103" name="Remove Duplicates (2)" width="90" x="179" y="187"/>
              <operator activated="true" class="aggregate" compatibility="8.1.003" expanded="true" height="82" name="Aggregate (2)" width="90" x="313" y="187">
                <list key="aggregation_attributes">
                  <parameter key="att_1" value="concatenation"/>
                </list>
                <parameter key="only_distinct" value="true"/>
              </operator>
              <connect from_port="exampleSet" to_op="Split (2)" to_port="example set input"/>
              <connect from_op="Split (2)" from_port="example set output" to_op="Trim (2)" to_port="example set input"/>
              <connect from_op="Trim (2)" from_port="example set output" to_op="Transpose (2)" to_port="example set input"/>
              <connect from_op="Transpose (2)" from_port="example set output" to_op="Select Attributes (3)" to_port="example set input"/>
              <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Remove Duplicates (2)" to_port="example set input"/>
              <connect from_op="Remove Duplicates (2)" from_port="example set output" to_op="Aggregate (2)" to_port="example set input"/>
              <connect from_op="Aggregate (2)" from_port="example set output" to_port="example set"/>
              <portSpacing port="source_exampleSet" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="rename" compatibility="8.1.003" expanded="true" height="82" name="Rename" width="90" x="313" y="34">
            <parameter key="old_name" value="concat(att_1)"/>
            <parameter key="new_name" value="People"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <connect from_port="example set" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_op="Work on Subset" to_port="example set"/>
          <connect from_op="Work on Subset" from_port="example set" to_op="Rename" to_port="example set input"/>
          <connect from_op="Rename" from_port="example set output" to_port="output 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="append" compatibility="8.1.003" expanded="true" height="82" name="Append" width="90" x="715" y="34"/>
      <operator activated="true" class="split" compatibility="8.1.003" expanded="true" height="82" name="Split (3)" width="90" x="849" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="People"/>
        <parameter key="split_pattern" value="\|"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
      <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="output 1" to_op="Append" to_port="example set 1"/>
      <connect from_op="Append" from_port="merged set" to_op="Split (3)" to_port="example set input"/>
      <connect from_op="Split (3)" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Edin_Klapic · May 2018

Hi @SylvainM,

Good to hear that you made it work

The key is that Remove Duplicates as well as Aggregate only work "vertically" and compare the full cell content.

You were on the right track by splitting the data - the only missing piece is to use Transpose to make a column out of the list of names.

Then you can easily apply Remove Duplicates and can reaggregate them afterwards.

The only drawback of this solution is that you need to use Loop Examples because this can take some time depending on the number of data rows.

Edit:

Do you know about the Breakpoints in RapidMiner? You can put one right after every Operator and inspect each intermediate result.

The easiest way to achieve this is to select the Debug Mode (Enable all Breakpoints) in the Edit menu of the toolbar.

You can also enable them manually by either rightclicking on the Operator and select Breakpoint after or leftclicking an Operator and press F7.

Happy Mining,

Edin

lionelderkrikor · May 2018

Hi @SylvainM,

To better understand can you share your dataset (or an extract of it) and

an example of what you want to get ?

Regards,

Lionel

SylvainM · May 2018

Thanks Lionel for you help!

You will find attached an Excel sheet and its csv version.

Sylvain

lionelderkrikor · May 2018

Hi @SylvainM,

to well understand, you want that :

1° example :

8

Body Like A Back Road

MCA Nashville

Sam Hunt

?

8.0

2017/02/18

2.0

2017/02/25

1.0

?

2017/02/02

Sam Hunt, Zach Crowell, Shane McAnally, Josh Osborne

Zach Crowell

become :

8

Body Like A Back Road

MCA Nashville

Sam Hunt

?

8.0

2017/02/18

2.0

2017/02/25

1.0

?

2017/02/02

Zach Crowell, Shane McAnally, Josh Osborne

Zach Crowell

and

2° example :

14

Coal Miner's Daughter

Columbia

Loretta Lynn, Sheryl Crow, Miranda Lambert

?

1.0

2010/12/04

55.0

2010/12/04

55.0

Coal Miner's Daughter: A Tribute to Loretta Lynn

2010/11/09

Loretta Lynn

?

become :

14

Coal Miner's Daughter

Columbia

Loretta Lynn, Sheryl Crow, Miranda Lambert

?

1.0

2010/12/04

55.0

2010/12/04

55.0

Coal Miner's Daughter: A Tribute to Loretta Lynn

2010/11/09

?

did I understand well ?

Regards,

Lionel

SylvainM · May 2018

Hi Lionel,

That's exactly the idea!

What I've done so far is that I've transformed my data as follows:

First example :

from

8

Body Like A Back Road

MCA Nashville

Sam Hunt

?

8.0

2017/02/18

2.0

2017/02/25

1.0

?

2017/02/02

Sam Hunt, Zach Crowell, Shane McAnally, Josh Osborne

Zach Crowell

to

8

Body Like A Back Road

2017

Sam Hunt

?

Sam Hunt

Zach Crowell

Shane McAnally

Josh Osborne

Zach Crowell

?

Or as a second example:

from

14

Coal Miner's Daughter

Columbia

Loretta Lynn, Sheryl Crow, Miranda Lambert

?

1.0

2010/12/04

55.0

2010/12/04

55.0

Coal Miner's Daughter: A Tribute to Loretta Lynn

2010/11/09

Loretta Lynn

?

to

14

Coal Miner's Daughter

2010

Loretta Lynn

Sheryl Crow

Miranda Lambert

Loretta Lynn

?

What I would like to automatically get is something like :

8

Body Like A Back Road

2017

Sam Hunt

Zach Crowell

Shane McAnally

Josh Osborne

and

14

Coal Miner's Daughter

2010

Loretta Lynn

Sheryl Crow

Miranda Lambert

I hope it's a little bit more clear...

Thank you so much for your help!

Sylvain

kayman · May 2018

Can't you use the aggregate attribute to do this? You just select the attributes you need and want to keep as they are, and anything that you would like to combine you group using the concatenate option. This gives you your unique values and everything else concatenated in one or more attributes.

In a second step you could split the concatenated values again if needed.

Or, you generate an ID, aggregate with the fields you need and add the 'first' ID, and then you can create joins using your dataset with limited ID's and your original dataset.

SylvainM · May 2018

Thanks Kayman for the answer, but unfortunately, it wouldn't work.

1. My number of rows is over 60.000,

2. I have to remove all the duplicates within each row,

3. But the attribution (column) of these duplicates vary from a row to another:

4. So I have to automaticaly apply "Remote Duplicates" for each row.

Issues:

1. I can't figure out how to do a "Remote Duplicates" iteration which would select a row after the other,

2. And the dimension of the result will not consistent from a row to the other (some may have only 2 names in it, others may have up to 10 names).

Partial solution:

1. I was therefore looking to get a collection of individual sets for each row that I would then Aggregate.

I hope I'm clear enough... Sorry if it's not the case!

But thanks for your help :-)

Sylvain

lionelderkrikor · May 2018

Hi @SylvainM,

Interesting but difficult task ......

My idea was to transpose your dataset and then use Remove Duplicates.

However after applying Remove Duplicates operator, there are still some duplicates and i don't know why... :

Has someone an idea ?

Regards,

Lionel

SylvainM · May 2018

Hi Lionel,

I'm stuck at the exact same place. I think it's comes from the fact that Remove Duplicates doesn't delete the redundant data itself, but the box in which it is. As the number of duplicated datas is not consistent from a song to another, it creates a "dimensionality" conflict between the rows (some rows may be 5 boxes long, others may be 9 boxes long...).

That's the reason why I was thinking about isolating each row from the original dataset (a sort of Split), apply Remove Duplicates to each of them, and then Aggregate the collection together. And that's what I cannot do... :-/

Thanks again,

Sylvain

SylvainM · May 2018

Thanks Edin!

I'm not super good with programing, so it will require me some time before being able to test it. I will confirm the problem Solved as soon as it works on my computer ;-)

Thanks a lot!

Sylvain

SylvainM · May 2018

It took me time, but it works! Now, I should carefully try to understand why :smileyvery-happy:

Thanks a lot, Edin :-)

Sylvain

SylvainM · May 2018

Great explanation and advice! You are answering a question I had yesterday in mind : Breakpoints are definitively something I will use a lot!

Thanks again for all your help!

Sylvain

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to apply "Remove Duplicates" on each attribute of an example set?

Best Answers

Answers