Adding a column that performs a distinct count

darrenvermaakdarrenvermaak Member Posts: 4 Contributor I
edited November 2018 in Help

Hi all,

Very new to Rapidminer!

I am needing to add a column into my table that runs a distinct count.

This is a sample of my table currently:

 

CaseNumber     Date
1A 1/1/2015
1A 1/2/2015
1A 1/2/2015
2A 2/1/2015
3A 3/1/2015
2A 4/1/2015

 

The Distinct Count would perform a distinct count on Case Number so:

CaseNumber  Date DistinctCount
1A  1/1/2015    1
1A  1/2/2015  0
1A  1/2/2015  0
2A  2/1/2015  1
2A  3/1/2015  0
3A  4/1/2015  1

 

Basically I just want to count how many unique case numbers there are.

So casenumber 1A occurs 3 times, but it's still just the 1 case number.

Same thing for casenumber 2A, it occurs twice, but still just the 1 case number.

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Hi,


    That's pretty simple to do if you use the Aggregate Operator and select "Only Distinct"

     

    Here's a sample process attached.

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.2.002" expanded="true" height="68" name="Retrieve Products" width="90" x="246" y="85">
    <parameter key="repository_entry" value="//Samples/data/Products"/>
    </operator>
    <operator activated="true" class="aggregate" compatibility="7.2.002" expanded="true" height="82" name="Aggregate" width="90" x="447" y="85">
    <list key="aggregation_attributes">
    <parameter key="Product Name" value="count"/>
    </list>
    <parameter key="group_by_attributes" value="Product Name"/>
    <parameter key="only_distinct" value="true"/>
    </operator>
    <connect from_op="Retrieve Products" from_port="output" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • darrenvermaakdarrenvermaak Member Posts: 4 Contributor I

    Hi Thomas,

    Thanks for the reply.

    I might missing something here, but I don't want just the distinct count. I want to keep all my existing data and add a new column that performs a distinct count on my CaseNumber field.

    When I run the Aggregate Operator and select "Only Distinct", I'm give just the distinct count and nothing else.

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Why not just use the "Remove Duplicate" operator?  Then your dataset will only contain the non-duplicated entries, and the total record count will equal the distinct record count.   You can specify the fields that define a unique record in that operator, so you can use both case number and date or any other combination of attributes.

     

    Best,

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • darrenvermaakdarrenvermaak Member Posts: 4 Contributor I

    There are a number of additional fields in this dataset that I need. 

    In this dataset, "casenumber" refers to a specific survey of about 50 questions, therefore there will be duplicate casenumbers because there are a number of different questions for each casenumber.

    Unfortunatley removing duplicates is not an option.

  • darrenvermaakdarrenvermaak Member Posts: 4 Contributor I

    I found a way to do this:

     

    Capture.PNG

     

Sign In or Register to comment.