"how to save the clustered result into two folders"

huaiyanggongzi · October 2012

I have a set of documents stored in a single folder. I run an unsupervised clustering algorithm, like K-means to construct two groups. Here is the workflow I created. Is there an approach that can separate the original folder into two folders based on the clustering result? In other words, I want to put the files belonging to cluster 1 into one folder and put the files belonging to cluster 2 into another folder.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
    <parameter key="parallelize_main_process" value="true"/>
    <process expanded="true" height="370" width="656">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.1.002" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="45" y="75">
        <list key="text_directories">
          <parameter key="NotResponsive" value="D:\User1\datamining\Data\training Sets"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="vector_creation" value="Term Frequency"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="5"/>
        <parameter key="prune_above_absolute" value="5000000"/>
        <parameter key="parallelize_vector_creation" value="true"/>
        <process expanded="true" height="380" width="674">
          <operator activated="true" class="text:tokenize" compatibility="5.1.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="180" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.1.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="514" y="120"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.1.011" expanded="true" height="76" name="Clustering" width="90" x="305" y="84"/>
      <connect from_op="Process Documents from Files (2)" from_port="example set" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

MariusHelf · October 2012

Hi,

first of all, filter the clustered dataset by the "cluster" attribute with Filter Examples. Then you can use "Loop Values" to loop over the "metadata_path" attribute. Loop Values creates an iteration macro which contains the current value, i.e. in this case the path of the document. You can use it as the "file" parameter of Move File. The choice of the second one is up to you and based on the cluster value.

Of course, instead of manually filtering each cluster value in the first step, you could use a second Loop Values to loop the cluster values.

Best,
Marius

roya67 · November 2012

Marius wrote:

Hi,

first of all, filter the clustered dataset by the "cluster" attribute with Filter Examples. Then you can use "Loop Values" to loop over the "metadata_path" attribute. Loop Values creates an iteration macro which contains the current value, i.e. in this case the path of the document. You can use it as the "file" parameter of Move File. The choice of the second one is up to you and based on the cluster value.

Of course, instead of manually filtering each cluster value in the first step, you could use a second Loop Values to loop the cluster values.

Best,
Marius

Hi, could you please explain it more? I don't have move file. what is loop values? thanks

MariusHelf · November 2012

Hi,

if you don't have the Move File operator, please update RapidMiner to the latest version (5.2.008). You'll find an explanation of Loop Values in this thread.

Best, Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"how to save the clustered result into two folders"

Answers