Options

deleting files based on file size

Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
edited December 2018 in Help

I'd like to get RapidMiner to look in a specific directory and automatically delete files based on their file size (i.e., delete files below a minimum size), but I don't see any options around checking file size in the Loop File or related operators.  I also see there is a Delete File operator but it seems to require you to point to a specific file by name.

Is this functionality to use file size present elsewhere, or am I missing some other way of handling it?  Or is this not an option within RapidMiner?  Thanks!

 

Brian T.
Lindon Ventures 
Data Science Consulting from Certified RapidMiner Experts

Best Answer

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Solution Accepted

    Hi @Telcontar120,

     

    I think it will be very difficult to perform this task with RapidMiner's native operators.

    So I propose once again a solution with a Python script (and a Loop Files operator).

    To execute this process, you have to set : 

     - the minimum size of the files (in Octets) you want to delete in the Set Macros operator.

    Remove_files.png

     - Of course, set the path where your files are stored in the Loop Files operator parameters.

     

    I hope it helps,

     

    Regards and happy deleting

     

    Lionel

Answers

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    Hello @Telcontar120,

     

    Is your machine a Linux one, a UNIX one or otherwise has access to findutils? If so, you can execute this command:

     

    find /home/telcontar120/.RapidMiner/path/for/your/data -type f -size +900k -size -1000k -iname "*.csv"

     

    Where -type f means files, -size +900k means files that are larger than 900k, and -size -1000k means files that are shorter than 1000k, and with an insensitively cased name of "*.csv".

     

    Hope this helps.

     

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi again @Telcontar120,

     

    I forget to attach the process in my last post : 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="set_macros" compatibility="9.0.000-BETA" expanded="true" height="82" name="Set min size" width="90" x="179" y="34">
    <list key="macros">
    <parameter key="minSize" value="1000000"/>
    </list>
    </operator>
    <operator activated="true" class="concurrency:loop_files" compatibility="9.0.000-BETA" expanded="true" height="103" name="Loop Files" width="90" x="313" y="34">
    <parameter key="directory" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Remove_files"/>
    <parameter key="filter_type" value="regex"/>
    <parameter key="enable_macros" value="true"/>
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="9.0.000-BETA" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="179" y="34">
    <list key="attribute_values">
    <parameter key="fold" value="%{folder_name}"/>
    <parameter key="file" value="%{file_name}"/>
    <parameter key="ext" value="%{file_type}"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_concatenation" compatibility="9.0.000-BETA" expanded="true" height="82" name="Generate Concatenation" width="90" x="313" y="34">
    <parameter key="first_attribute" value="fold"/>
    <parameter key="second_attribute" value="file"/>
    <parameter key="separator" value="/"/>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Execute Python" width="90" x="648" y="34">
    <parameter key="script" value="import os&#10;&#10;minimumSize = %{minSize}&#10;&#10;def rm_main(data):&#10;&#10; path = data.iloc[0,3]&#10;&#10; if os.path.getsize(path) &lt; minimumSize:&#10; os.remove(path)&#10; return data"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Concatenation" to_port="example set input"/>
    <connect from_op="Generate Concatenation" from_port="example set output" to_op="Execute Python" to_port="input 1"/>
    <connect from_op="Execute Python" from_port="output 1" to_port="output 1"/>
    <connect from_op="Execute Python" from_port="output 2" to_port="output 2"/>
    <portSpacing port="source_file object" spacing="0"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    <portSpacing port="sink_output 3" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Set min size" from_port="through 1" to_op="Loop Files" to_port="input 1"/>
    <connect from_op="Loop Files" from_port="output 1" to_port="result 1"/>
    <connect from_op="Loop Files" from_port="output 2" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

     

    Lionel

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    @rfuentealba unfortunately no Linux here, just a simple Windows machine.

    @lionelderkrikor thanks for the python script, that should do the trick!

    Too bad there is no native RapidMiner operator for handling this.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    P.S.  I've added this as a new product idea in that forum, so if you think being able to deal with file size inside RapidMiner natively would be a helpful feature, please go over there and vote for that idea!

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    yep already opened it for voting and created internal ticket for dev team. Thanks @Telcontar120 for the suggestion!

     

    Scott

     

Sign In or Register to comment.