Options

"how to save the clustered result into two folders"

huaiyanggongzihuaiyanggongzi Member Posts: 39 Contributor II
edited June 2019 in Help
I have a set of documents stored in a single folder. I run an unsupervised clustering algorithm, like K-means to construct two groups. Here is the workflow I created. Is there an approach that can separate the original folder into two folders based on the clustering result?  In other words, I want to put the files belonging to cluster 1 into one folder and put the files belonging to cluster 2 into another folder.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
    <parameter key="parallelize_main_process" value="true"/>
    <process expanded="true" height="370" width="656">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.1.002" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="45" y="75">
        <list key="text_directories">
          <parameter key="NotResponsive" value="D:\User1\datamining\Data\training Sets"/>
        </list>
        <parameter key="extract_text_only" value="false"/>
        <parameter key="vector_creation" value="Term Frequency"/>
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="5"/>
        <parameter key="prune_above_absolute" value="5000000"/>
        <parameter key="parallelize_vector_creation" value="true"/>
        <process expanded="true" height="380" width="674">
          <operator activated="true" class="text:tokenize" compatibility="5.1.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.1.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="180" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.1.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="514" y="120"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="k_means" compatibility="5.1.011" expanded="true" height="76" name="Clustering" width="90" x="305" y="84"/>
      <connect from_op="Process Documents from Files (2)" from_port="example set" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>
Tagged:

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    first of all, filter the clustered dataset by the "cluster" attribute with Filter Examples. Then you can use "Loop Values" to loop over the "metadata_path" attribute. Loop Values creates an iteration macro which contains the current value, i.e. in this case the path of the document. You can use it as the "file" parameter of Move File. The choice of the second one is up to you and based on the cluster value.

    Of course, instead of manually filtering each cluster value in the first step, you could use a second Loop Values to loop the cluster values.

    Best,
    Marius
  • Options
    roya67roya67 Member Posts: 10 Contributor II
    Marius wrote:

    Hi,

    first of all, filter the clustered dataset by the "cluster" attribute with Filter Examples. Then you can use "Loop Values" to loop over the "metadata_path" attribute. Loop Values creates an iteration macro which contains the current value, i.e. in this case the path of the document. You can use it as the "file" parameter of Move File. The choice of the second one is up to you and based on the cluster value.

    Of course, instead of manually filtering each cluster value in the first step, you could use a second Loop Values to loop the cluster values.

    Best,
    Marius
    Hi, could you please explain it more? I don't have move file. what is loop values? thanks
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    if you don't have the Move File operator, please update RapidMiner to the latest version (5.2.008). You'll find an explanation of Loop Values in this thread.

    Best, Marius
Sign In or Register to comment.