Options

How to apply "Remove Duplicates" on each attribute of an example set?

SylvainMSylvainM Member Posts: 18 Maven

Hello everyone,

 

I'm a total beginner with Rapid Miner, and I apologize if my question is an obvious one or if it has been posted under another title. I work in Humanities, so I'm usually not very familiar with computing. However, Rapid Miner has given me amazing results so far, and I'd like to dig further.

 

I have a dateset of songs in which, for each Title, I have all the Artists involved in an column, all the Songwriters in a second column, and all the Producers in a third. I have been able to split these columns in order to have only a single name per box (Artist_1, Artist_2, Songwriter_1,...). The issue I have is that, for each song, an artist can also be the songwriter and/or the producer. I would like to get the name of each person involved only once.

 

As I have 60.000 songs, I first thought to do a Loop to automatically select one song after the other and apply "Remove Duplicates" for each of them. I would have expected a collection of individual sets, i.e. each song with only each name once. But I can't figure out how to automatically select only one song after another... 

 

I hope my description is clear enough.

 

Thank you for your help :-)

Best Answers

  • Options
    Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist
    Solution Accepted

    Hi @SylvainM,

     

    Done :)

     

    Happy Mining,

    Edin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="8.1.003" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34">
    <parameter key="csv_file" value="C:\Users\EdinKlapic\Desktop\Song Data-Tableau 1.csv"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="encoding" value="windows-1252"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Title.true.polynominal.attribute"/>
    <parameter key="1" value="Imprint.true.polynominal.attribute"/>
    <parameter key="2" value="Artist.true.polynominal.attribute"/>
    <parameter key="3" value="Label.true.polynominal.attribute"/>
    <parameter key="4" value="Chart Weeks.true.integer.attribute"/>
    <parameter key="5" value="Debut.true.polynominal.attribute"/>
    <parameter key="6" value="Debut Pos.true.integer.attribute"/>
    <parameter key="7" value="Peak.true.polynominal.attribute"/>
    <parameter key="8" value="Peak Pos.true.integer.attribute"/>
    <parameter key="9" value="Album title.true.polynominal.attribute"/>
    <parameter key="10" value="Album release date.true.polynominal.attribute"/>
    <parameter key="11" value="Songwriters.true.polynominal.attribute"/>
    <parameter key="12" value="Producers.true.polynominal.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
    <list key="function_descriptions">
    <parameter key="Year" value="prefix([Album release date],4)"/>
    </list>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.1.003" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="313" y="34">
    <list key="function_descriptions">
    <parameter key="People" value="concat(Artist,&quot;,&quot;,Songwriters)"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.1.003" expanded="true" height="82" name="Select Attributes (2)" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Title|Year|People"/>
    </operator>
    <operator activated="true" class="loop_examples" compatibility="8.1.003" expanded="true" height="103" name="Loop Examples" width="90" x="581" y="34">
    <process expanded="true">
    <operator activated="true" class="filter_example_range" compatibility="8.1.003" expanded="true" height="82" name="Filter Example Range" width="90" x="45" y="34">
    <parameter key="first_example" value="%{example}"/>
    <parameter key="last_example" value="%{example}"/>
    </operator>
    <operator activated="true" class="work_on_subset" compatibility="8.1.003" expanded="true" height="82" name="Work on Subset" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="People"/>
    <process expanded="true">
    <operator activated="true" class="split" compatibility="8.1.003" expanded="true" height="82" name="Split (2)" width="90" x="45" y="34"/>
    <operator activated="true" class="trim" compatibility="8.1.003" expanded="true" height="82" name="Trim (2)" width="90" x="179" y="34"/>
    <operator activated="true" class="transpose" compatibility="8.1.003" expanded="true" height="82" name="Transpose (2)" width="90" x="313" y="34"/>
    <operator activated="true" class="select_attributes" compatibility="8.1.003" expanded="true" height="82" name="Select Attributes (3)" width="90" x="45" y="187">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="id"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="remove_duplicates" compatibility="8.1.003" expanded="true" height="103" name="Remove Duplicates (2)" width="90" x="179" y="187"/>
    <operator activated="true" class="aggregate" compatibility="8.1.003" expanded="true" height="82" name="Aggregate (2)" width="90" x="313" y="187">
    <list key="aggregation_attributes">
    <parameter key="att_1" value="concatenation"/>
    </list>
    <parameter key="only_distinct" value="true"/>
    </operator>
    <connect from_port="exampleSet" to_op="Split (2)" to_port="example set input"/>
    <connect from_op="Split (2)" from_port="example set output" to_op="Trim (2)" to_port="example set input"/>
    <connect from_op="Trim (2)" from_port="example set output" to_op="Transpose (2)" to_port="example set input"/>
    <connect from_op="Transpose (2)" from_port="example set output" to_op="Select Attributes (3)" to_port="example set input"/>
    <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Remove Duplicates (2)" to_port="example set input"/>
    <connect from_op="Remove Duplicates (2)" from_port="example set output" to_op="Aggregate (2)" to_port="example set input"/>
    <connect from_op="Aggregate (2)" from_port="example set output" to_port="example set"/>
    <portSpacing port="source_exampleSet" spacing="0"/>
    <portSpacing port="sink_example set" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="rename" compatibility="8.1.003" expanded="true" height="82" name="Rename" width="90" x="313" y="34">
    <parameter key="old_name" value="concat(att_1)"/>
    <parameter key="new_name" value="People"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <connect from_port="example set" to_op="Filter Example Range" to_port="example set input"/>
    <connect from_op="Filter Example Range" from_port="example set output" to_op="Work on Subset" to_port="example set"/>
    <connect from_op="Work on Subset" from_port="example set" to_op="Rename" to_port="example set input"/>
    <connect from_op="Rename" from_port="example set output" to_port="output 1"/>
    <portSpacing port="source_example set" spacing="0"/>
    <portSpacing port="sink_example set" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="8.1.003" expanded="true" height="82" name="Append" width="90" x="715" y="34"/>
    <operator activated="true" class="split" compatibility="8.1.003" expanded="true" height="82" name="Split (3)" width="90" x="849" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="People"/>
    <parameter key="split_pattern" value="\|"/>
    </operator>
    <connect from_op="Read CSV" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
    <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
    <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Loop Examples" to_port="example set"/>
    <connect from_op="Loop Examples" from_port="output 1" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_op="Split (3)" to_port="example set input"/>
    <connect from_op="Split (3)" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • Options
    Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist
    Solution Accepted

    Hi @SylvainM,

     

    Good to hear that you made it work :)

     

    The key is that Remove Duplicates as well as Aggregate only work "vertically" and compare the full cell content.

    You were on the right track by splitting the data - the only missing piece is to use Transpose to make a column out of the list of names.

    Then you can easily apply Remove Duplicates and can reaggregate them afterwards.

    The only drawback of this solution is that you need to use Loop Examples because this can take some time depending on the number of data rows.

     

    Edit:

    Do you know about the Breakpoints in RapidMiner? You can put one right after every Operator and inspect each intermediate result.

    The easiest way to achieve this is to select the Debug Mode (Enable all Breakpoints) in the Edit menu of the toolbar.

    You can also enable them manually by either rightclicking on the Operator and select Breakpoint after or leftclicking an Operator and press F7.

     

    Happy Mining,

    Edin

Answers

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @SylvainM,

     

    To better understand can you share your dataset (or an extract of it) and 

    an example of what you want to get ?

     

    Regards,

     

    Lionel

  • Options
    SylvainMSylvainM Member Posts: 18 Maven

    Thanks Lionel for you help!

     

    You will find attached an Excel sheet and its csv version. 

    Sylvain

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @SylvainM,

     

    to well understand, you want that : 

     

    1° example : 

    8 Body Like A Back Road MCA Nashville Sam Hunt ? 8.0 2017/02/18 2.0 2017/02/25 1.0 ? 2017/02/02 Sam Hunt, Zach Crowell, Shane McAnally, Josh Osborne Zach Crowell

     

    become : 

     

    8 Body Like A Back Road MCA Nashville Sam Hunt ? 8.0 2017/02/18 2.0 2017/02/25 1.0 ? 2017/02/02  Zach Crowell, Shane McAnally, Josh Osborne Zach Crowell

     

    and 

    2° example :

     

    14 Coal Miner's Daughter Columbia Loretta Lynn, Sheryl Crow, Miranda Lambert ? 1.0 2010/12/04 55.0 2010/12/04 55.0 Coal Miner's Daughter: A Tribute to Loretta Lynn 2010/11/09 Loretta Lynn ?

     

    become : 

     

    14 Coal Miner's Daughter Columbia Loretta Lynn, Sheryl Crow, Miranda Lambert ? 1.0 2010/12/04 55.0 2010/12/04 55.0 Coal Miner's Daughter: A Tribute to Loretta Lynn 2010/11/09 ? ?

     

    did I understand well ?

     

    Regards,

     

     

    Lionel

  • Options
    SylvainMSylvainM Member Posts: 18 Maven

    Hi Lionel,

     

    That's exactly the idea!

     

    What I've done so far is that I've transformed my data as follows:

     

    First example : 

    from

    8 Body Like A Back Road MCA Nashville Sam Hunt ? 8.0 2017/02/18 2.0 2017/02/25 1.0 ? 2017/02/02 Sam Hunt, Zach Crowell, Shane McAnally, Josh Osborne Zach Crowell

     

    to 

    8 Body Like A Back Road 2017 Sam Hunt ? ? Sam Hunt

    Zach Crowell

    Shane McAnally

    Josh Osborne

    Zach Crowell

    ? ? ?

     

    Or as a second example:

     

    from

    14 Coal Miner's Daughter Columbia Loretta Lynn, Sheryl Crow, Miranda Lambert ? 1.0 2010/12/04 55.0 2010/12/04 55.0 Coal Miner's Daughter: A Tribute to Loretta Lynn 2010/11/09 Loretta Lynn ?

     

    to 

    14 Coal Miner's Daughter 2010 Loretta Lynn Sheryl Crow Miranda Lambert Loretta Lynn ? ? ? ? ? ? ?

     

     

    What I would like to automatically get is something like :

     

    8 Body Like A Back Road 2017 Sam Hunt

    Zach Crowell

    Shane McAnally

    Josh Osborne

     

    and

     

    14 Coal Miner's Daughter 2010 Loretta Lynn Sheryl Crow Miranda Lambert

     

     

    I hope it's a little bit more clear...

     

    Thank you so much for your help!

    Sylvain

  • Options
    kaymankayman Member Posts: 662 Unicorn

    Can't you use the aggregate attribute to do this? You just select the attributes you need and want to keep as they are, and anything that you would like to combine you group using the concatenate option. This gives you your unique values and everything else concatenated in one or more attributes.

    In a second step you could split the concatenated values again if needed.

     

    Or, you generate an ID, aggregate with the fields you need and add the 'first' ID, and then you can create joins using your dataset with limited ID's and your original dataset.

  • Options
    SylvainMSylvainM Member Posts: 18 Maven

    Thanks Kayman for the answer, but unfortunately, it wouldn't work. 

     

    1. My number of rows is over 60.000,

    2. I have to remove all the duplicates within each row,

    3. But the attribution (column) of these duplicates vary from a row to another:

    4. So I have to automaticaly apply "Remote Duplicates" for each row.

     

    Issues: 

    1. I can't figure out how to do a "Remote Duplicates" iteration which would select a row after the other,

    2. And the dimension of the result will not consistent from a row to the other (some may have only 2 names in it, others may have up to 10 names).

     

    Partial solution:

    1. I was therefore looking to get a collection of individual sets for each row that I would then Aggregate.

     

    I hope I'm clear enough... Sorry if it's not the case!

     

    But thanks for your help :-)

    Sylvain

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @SylvainM,

     

    Interesting but difficult task ......

     

    My idea was to transpose your dataset and then use Remove Duplicates.

    However after applying Remove Duplicates operator, there are still some duplicates and i don't know why... : 

    Duplicate_song.png

     

    Has someone an idea ?

     

    Regards,

     

    Lionel

     

     

     

  • Options
    SylvainMSylvainM Member Posts: 18 Maven

    Hi Lionel,

     

    I'm stuck at the exact same place. I think it's comes from the fact that Remove Duplicates doesn't delete the redundant data itself, but the box in which it is. As the number of duplicated datas is not consistent from a song to another, it creates a "dimensionality" conflict between the rows (some rows may be 5 boxes long, others may be 9 boxes long...). 

    That's the reason why I was thinking about isolating each row from the original dataset (a sort of Split), apply Remove Duplicates to each of them, and then Aggregate the collection together. And that's what I cannot do... :-/ 

     

    Thanks again,

    Sylvain 

  • Options
    SylvainMSylvainM Member Posts: 18 Maven

    Thanks Edin!

     

    I'm not super good with programing, so it will require me some time before being able to test it. I will confirm the problem Solved as soon as it works on my computer ;-) 

     

    Thanks a lot!

    Sylvain

  • Options
    SylvainMSylvainM Member Posts: 18 Maven

    It took me time, but it works! Now, I should carefully try to understand why :smileyvery-happy:

     

    Thanks a lot, Edin :-) 

    Sylvain

  • Options
    SylvainMSylvainM Member Posts: 18 Maven

    Great explanation and advice! You are answering a question I had yesterday in mind : Breakpoints are definitively something I will use a lot!

     

    Thanks again for all your help!

    Sylvain

Sign In or Register to comment.