deleting files based on file size

Telcontar120 · July 2018

I'd like to get RapidMiner to look in a specific directory and automatically delete files based on their file size (i.e., delete files below a minimum size), but I don't see any options around checking file size in the Loop File or related operators. I also see there is a Delete File operator but it seems to require you to point to a specific file by name.

Is this functionality to use file size present elsewhere, or am I missing some other way of handling it? Or is this not an option within RapidMiner? Thanks!

lionelderkrikor · July 2018

Hi @Telcontar120,

I think it will be very difficult to perform this task with RapidMiner's native operators.

So I propose once again a solution with a Python script (and a Loop Files operator).

To execute this process, you have to set :

- the minimum size of the files (in Octets) you want to delete in the Set Macros operator.

- Of course, set the path where your files are stored in the Loop Files operator parameters.

I hope it helps,

Regards and happy deleting

Lionel

rfuentealba · July 2018

Hello @Telcontar120,

Is your machine a Linux one, a UNIX one or otherwise has access to findutils? If so, you can execute this command:

find /home/telcontar120/.RapidMiner/path/for/your/data -type f -size +900k -size -1000k -iname "*.csv"

Where -type f means files, -size +900k means files that are larger than 900k, and -size -1000k means files that are shorter than 1000k, and with an insensitively cased name of "*.csv".

Hope this helps.

lionelderkrikor · July 2018

Hi again @Telcontar120,

I forget to attach the process in my last post :

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="set_macros" compatibility="9.0.000-BETA" expanded="true" height="82" name="Set min size" width="90" x="179" y="34">
        <list key="macros">
          <parameter key="minSize" value="1000000"/>
        </list>
      </operator>
      <operator activated="true" class="concurrency:loop_files" compatibility="9.0.000-BETA" expanded="true" height="103" name="Loop Files" width="90" x="313" y="34">
        <parameter key="directory" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Remove_files"/>
        <parameter key="filter_type" value="regex"/>
        <parameter key="enable_macros" value="true"/>
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="9.0.000-BETA" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="179" y="34">
            <list key="attribute_values">
              <parameter key="fold" value="%{folder_name}"/>
              <parameter key="file" value="%{file_name}"/>
              <parameter key="ext" value="%{file_type}"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="generate_concatenation" compatibility="9.0.000-BETA" expanded="true" height="82" name="Generate Concatenation" width="90" x="313" y="34">
            <parameter key="first_attribute" value="fold"/>
            <parameter key="second_attribute" value="file"/>
            <parameter key="separator" value="/"/>
          </operator>
          <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="103" name="Execute Python" width="90" x="648" y="34">
            <parameter key="script" value="import os&#10;&#10;minimumSize = %{minSize}&#10;&#10;def rm_main(data):&#10;&#10;  path = data.iloc[0,3]&#10;&#10;  if os.path.getsize(path) &lt; minimumSize:&#10;    os.remove(path)&#10;  return data"/>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Concatenation" to_port="example set input"/>
          <connect from_op="Generate Concatenation" from_port="example set output" to_op="Execute Python" to_port="input 1"/>
          <connect from_op="Execute Python" from_port="output 1" to_port="output 1"/>
          <connect from_op="Execute Python" from_port="output 2" to_port="output 2"/>
          <portSpacing port="source_file object" spacing="0"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <portSpacing port="sink_output 3" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Set min size" from_port="through 1" to_op="Loop Files" to_port="input 1"/>
      <connect from_op="Loop Files" from_port="output 1" to_port="result 1"/>
      <connect from_op="Loop Files" from_port="output 2" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Regards,

Lionel

Telcontar120 · July 2018

@rfuentealba unfortunately no Linux here, just a simple Windows machine.

@lionelderkrikor thanks for the python script, that should do the trick!

Too bad there is no native RapidMiner operator for handling this.

Telcontar120 · July 2018

P.S. I've added this as a new product idea in that forum, so if you think being able to deal with file size inside RapidMiner natively would be a helpful feature, please go over there and vote for that idea!

sgenzer · July 2018

yep already opened it for voting and created internal ticket for dev team. Thanks @Telcontar120 for the suggestion!

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

deleting files based on file size

Best Answer

Answers