RapidMiner

K-means clustering over 8000 text file

Regular Contributor

K-means clustering over 8000 text file

hi, I'm new to use this platform. I want to use k-means to cluster 8000 text file that contains tags of 8000 image, if it possible to use rapidminer  or not? and if it's possible what is the suitable K and max runs should be chosen?

 

Regard

See more topics labeled with:

22 REPLIES
Community Manager

Re: K-means clustering over 8000 text file

[ Edited ]

Yes you can do that with RapidMiner but just to be sure, the texts don't contain actual images? like jpgs or pngs? If you want to do image mining you have to install the Image Mining extension. 

 

W.R.T. to the # of optimal clusters. I usually use X-means to figure that out automatically. 

 

Here's a sample process that will get you started. You will need to install the Text Mining extension to do this.

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
        <parameter key="connection" value="NewConnection"/>
        <parameter key="query" value="rapidminer"/>
        <parameter key="language" value="en"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.4.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.4.000" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.4.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prune_below_percent" value="5.0"/>
        <parameter key="prune_above_percent" value="50.0"/>
        <parameter key="prune_below_absolute" value="100"/>
        <parameter key="prune_above_absolute" value="500"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:replace_tokens" compatibility="7.4.001" expanded="true" height="68" name="Replace Tokens (2)" width="90" x="45" y="34">
            <list key="replace_dictionary">
              <parameter key="http.*" value="link"/>
            </list>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
            <parameter key="characters" value=" .!;:[,"/>
          </operator>
          <operator activated="true" class="text:transform_cases" compatibility="7.4.001" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="7.4.001" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34"/>
          <operator activated="true" class="text:filter_tokens_by_content" compatibility="7.4.001" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="581" y="34">
            <parameter key="string" value="link"/>
            <parameter key="invert condition" value="true"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="7.4.001" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="715" y="34"/>
          <connect from_port="document" to_op="Replace Tokens (2)" to_port="document"/>
          <connect from_op="Replace Tokens (2)" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
          <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="x_means" compatibility="7.4.000" expanded="true" height="82" name="X-Means" width="90" x="581" y="34">
        <parameter key="numerical_measure" value="CosineSimilarity"/>
        <parameter key="divergence" value="SquaredEuclideanDistance"/>
      </operator>
      <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="X-Means" to_port="example set"/>
      <connect from_op="X-Means" from_port="cluster model" to_port="result 1"/>
      <connect from_op="X-Means" from_port="clustered set" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

 

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Regular Contributor

Re: K-means clustering over 8000 text file

hello, thank you for replay.

I'm starting by installing the text proccing operator from extension nad update, then drag the "Process Documents from Files"  operator and I “Edit List” beside the "text directories” label in order to choose the files that I use to run the clustering algorithm on it. Then I open the “Process Documents from Files” operator (by double click ) to Insert the “Extract Content” operator into the Main Process. after that drag "Tokenize" operator into the "Process Documents from Files" process after the “Extract Content” operator. Then  I get out of the “Process Documents from Files” process. And use the standard k-Means algorithm by dragging into the Main Process frame after the “Process Documents from Files” operator. I was set the K= 89 and max runs=8000. finally when I press “Play” button and this take until now 5:16 hours and not finish yet?  I don't know if it's OK ? and why the run does not finish yet?

 

 

for "W.R.T. to the # of optimal clusters. I usually use X-means to figure that out automatically" can you explain to me how can I figure it?

 

 

Best Regard.

 

 

Regular Contributor

Re: K-means clustering over 8000 text file

[ Edited ]

This the screenshot could you help me please??

rapid.png

 

 

 

 

 

Community Manager

Re: K-means clustering over 8000 text file

It's quite possibly that it could take 10 hours, hard to fathom without knowing how wide your dataset got from the Text Processing. I would consider doing Pruning and getting your data set all text processed before you do the Clustering, this way you can speed up the process.  Why do you need 89 clusters anyway?

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Regular Contributor

Re: K-means clustering over 8000 text file

You mean it's good to do text processing first then clustering?

 

I chose k=89 this the closest to the square root of 8000. Would you mind to tell me how I can choose it automatically? And in case the laptop restarted the run start again or complete from the previous one.

 

Many thanx.

Community Manager

Re: K-means clustering over 8000 text file

Just use the x-means operator and set the k limits. the default has a min of 2 and a max of 60. 

 

What I would do is put a Store operator right after the EXA port of the Process Documents from Files operator. This way you can save the processed text and inspect it. You could also try a Sample operator to take a random sample of maybe 500 rows to see how long it would take to process then.  

 

In cases like this we usually suggest you use a RapidMiner Server on a dedicated box with lots of memory and cores. Of course that pre-supposes that you have a license that will unlock the cores and memory on the Server. 

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Regular Contributor

Re: K-means clustering over 8000 text file

rapid.pngstill the output not appear yet ?? is that possible ??

Community Manager

Re: K-means clustering over 8000 text file

Based on it being 2% done, you'll have to wait about 98 days for it to finish.

 

You must have a very very wide data set. Did you try the sampling as I proposed. You might have to do some heavy pruning of your text files too. 

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Regular Contributor

Re: K-means clustering over 8000 text file

Really no, I'm a beginner in that. Would you mind to explain to me the steps to use the samples?

 

Thanx in advanced.