The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

Easiest way to remove empty rows and attributes?

kaymankayman Member Posts: 662 Unicorn
edited December 2018 in Help

 

Hi,

What would be the easiest way to remove empty columns / rows in a dataset?

My data is imported from source data that I can't easily modify before load so I prefer to deal with it within rapidminer.

 

What I'm looking at is a simple way to remove both Col1 and Row 1 in below example, but in reality I can have more than a few empty rows and columns in much more heavy tables. I've tried to use the filter missing / no missing attributes but it basically removes everything as soon as it finds a missing value so not exactly what I need.

 

  Col1 Col2 Col3
1 ? ? ?
2 ? ? Some data
3 ? ? Some data
4 ? ? Some data
5 ? Some data ?

 

For now I have a fairly heavy workflow where I first create an id, then multiply my data and have one set where I replace all missing values with a 0, everything else with 1 and then use aggregates to sum and remove every row where sum = 0. Next I loop through all my attributes, do something similar and remove all columns where the aggregated sum is 0 again. 

 

It does the trick but seems a bit overkill, so I'm wondering if I'm missing some easy way to deal with this.

Best Answer

  • tftemmetftemme Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164 RM Research
    Solution Accepted

    Hi @kayman, Hi @mschmitz,

     

    From my point of view this would be a handy addition to the Toolbox.

    I will have a look, if I find time for this.

     

    Best regards,
    Fabian

Answers

  • MaerkliMaerkli Member Posts: 84 Guru

    Hallo Kayman,

     

    Could Select Attributes be a solution?

    Maerkli

  • kaymankayman Member Posts: 662 Unicorn

    Hi Maerkly.

     

    I don't thinks so, the behaviour of the 'no_missing_values' seems to be that as soon as one value is missing you (rightfully) get a hit, so in this case also Col2 and Col3 would be impacted since they have at least one missing value. 

     

    I'd need kind of the opposite, as soon as there is a value it's ok, otherwise skip. Or something like 'if all values are missing' skip, otherwise keep.

  • MaerkliMaerkli Member Posts: 84 Guru

    I am back, Kayman. It is just a suggestion: have a look at RM Studio/Repository/Training Resources/ETL-Advanced/Aggregations/Hotel App Aggregations & Pivoting Solution. With Pivot, you get another perspective on your data and you can apply another treatment like Replace Missing Values. Keep us posted.

    Maerkli

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    I think you may be making this more complicated than necessary.  There are two simple ETL operators that can solve these issues. Remove Useless Attributes will do the trick for you to get rid of attributes (columns) that are all missing (either nominal or numerical), and Filter Examples has a pre-defined condition in the dropdown for "no missing attributes" that should get rid of all your blank examples (rows).

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • kaymankayman Member Posts: 662 Unicorn

    Not really, as the missing attributes operator just looks if there is at least one missing value, where I need to have the option to see if ALL values are missing. I don't care for the occasional missing value, just if my whole example or column is missing. 

     

    So if I use the missing attributes operator and loop it through my collection, only the sets without any single missing attribute are returned, which is what the operator promised but not what I wanted to see...

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    @kayman Got it, but I think it can still be done with the two operators I describe.  In Filter Examples you just have to set the conditions to each attribute to be "is missing" with "match all" logic and then invert the operator.  No looping is needed with either.  I think that the the attached process does exactly what you want to do (using an exampleset similar to your original sample) with just the two operators I referenced.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" breakpoints="after" class="operator_toolbox:create_exampleset" compatibility="1.5.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="Row,Att1,Att2,Att3&#10;1,4,7,&#10;2,6,,&#10;3,5,8,&#10;,,,&#10;5,2,,&#10;"/>
    </operator>
    <operator activated="true" breakpoints="after" class="remove_useless_attributes" compatibility="9.0.003" expanded="true" height="82" name="Remove Useless Attributes" width="90" x="246" y="34"/>
    <operator activated="true" breakpoints="after" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="380" y="34">
    <parameter key="invert_filter" value="true"/>
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Row.is_missing."/>
    <parameter key="filters_entry_key" value="Att1.is_missing."/>
    <parameter key="filters_entry_key" value="Att2.is_missing."/>
    </list>
    </operator>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Remove Useless Attributes" to_port="example set input"/>
    <connect from_op="Remove Useless Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    And if you don't want to declare a separate condition for each attribute (maybe you have a lot of attributes in your real dataset) then you can get around that pretty easily by using Generate Aggregation to count the nonmissings and then filter purely based on that, as in the attached process.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" breakpoints="after" class="operator_toolbox:create_exampleset" compatibility="1.5.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="Row,Att1,Att2,Att3,Att4&#10;1,4,7,,&quot;yes&quot;&#10;2,6,,,&quot;no&quot;&#10;3,5,8,,&quot;yes&quot;&#10;,,,,&#10;5,,2,,&quot;green&quot;&#10;"/>
    </operator>
    <operator activated="true" breakpoints="after" class="trim" compatibility="9.0.003" expanded="true" height="82" name="Trim" width="90" x="246" y="34"/>
    <operator activated="true" breakpoints="after" class="remove_useless_attributes" compatibility="9.0.003" expanded="true" height="82" name="Remove Useless Attributes" width="90" x="380" y="34"/>
    <operator activated="true" breakpoints="after" class="generate_aggregation" compatibility="9.0.003" expanded="true" height="82" name="Generate Aggregation" width="90" x="514" y="34">
    <parameter key="attribute_name" value="count"/>
    <parameter key="aggregation_function" value="count"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="9.0.003" expanded="true" height="103" name="Filter Examples" width="90" x="648" y="34">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="count.gt.0"/>
    </list>
    </operator>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Trim" to_port="example set input"/>
    <connect from_op="Trim" from_port="example set output" to_op="Remove Useless Attributes" to_port="example set input"/>
    <connect from_op="Remove Useless Attributes" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/>
    <connect from_op="Generate Aggregation" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • kaymankayman Member Posts: 662 Unicorn

    Yeah, but my data is unfortunatly not that simple :-)

    I do not know up front how many attributes I have, let alone the name, so I'm back to looping through my content again.

     

    Also, the remove useless atributes has the unwanted side effect that when my column actually contains too many times the same content it will also be removed.

    So if I have something like

     

    COL1,COL2,COL3

    ?,1,2

    ?,2,2

     

    both COL1 and COL3 will be removed. Which is correct behaviour for the operator but again a bit too greedy for my purpose.

    Seems afterall that for a simple problem it requires a complex solution... 

     

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,525 RM Data Scientist

    Hi,

    i usually go for something like aggregate without missings - #examples and check if this is 0. If yes, i do data to weights and got something.

     

    Maybe this is not too straight foward and we should put something into toolbox for it? 

    BR,

    Martin

     

    CC: @tftemme

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • kaymankayman Member Posts: 662 Unicorn

    Hi Martin, that's my current strategy also (and even having some mongo in the middle :-))

    I was just wondering if there were easier ways, as this looks to me like quite a common scenario.

     

    I can help myself, but if you can add it in the toolbox it would be handy indeed.

  • kaymankayman Member Posts: 662 Unicorn
    Works like a dream, thanks a lot !
Sign In or Register to comment.