"Text Mining - Clustering Task - DISCOVER THE CONTENT OF EACH CLUSTER"

Marcello_Sandi · June 2009

Hi,

My problem is Unsupervised Learning, because, as I said, my BOW has exactly 2290 attributes and 1572 examples. It does't has any label, just descriptors extracted to the texts and one attribute that is the name of the documents, which I put as a label.

I need to find the optimal number of clusters first. I did that model to discover it. I didn't know that the RapidMiner KMeans already had an implementation to the local minimum problem.

Opening a parenthesis about it, what kind of algorithm/theory do you use in this case? I only need put some reference in my thesis and explain it.

So, I leave the "ParameterIteration" to run over about an interval of desired clusters, and exclud the "RandomOptimizer" because it's not necessary. Do you has another suggetion?

Finally, I want measure the quality of my clusters. Using "ParameterIteration" I can generate scatter plot over "ClusterCentroidEvaluator" and I can see the relations about AVG and DB distances over each cluster. Do you has any other choice?

The problem, in this case, is because there are a lot of attributes, ie, a lot of descriptors.

I want to label or characterize each cluster.

I would be very grateful and happy for any help.

Marcello Sandi

land · June 2009

Hi Marcello,
your setup seems to be well suited for your case. For cluster characterization, usually an understandable classification model is used. For example use the one rule learner, or a tree with a small depth.

KMeans is restarted as often as specified and the solution with the minimal intra cluster distance is chose, if I remember correctly.

Greetings,
Sebastian

Marcello_Sandi · June 2009

Hi Sebastian,

You are the man.....

If it's possible, please, setup me an example model. I'm not still able to do it alone.

About KMeans, just if you can talk .....the solution with the minimal intra cluster distance is chose....

You could tell me what is the solution? Only the name of the algorithm or the theory for me is good.

Thanks for all,
Marcello

land · June 2009

Hi Marcello,
you simply have to change the cluster attribute's role into label and then use the learner. I think you will be able to set up this process on your own.

"Choosing the solution with the minimal average intra cluster distance" very well describes the algorithm. I don't think there's a special name for this three liner.

Greetings,
Sebastian

Marcello_Sandi · June 2009

Sebastian,

I did this model. Is it good.?

<operator name="Processo de Optimização do Centroid do KMeans" class="Process" expanded="yes">
<description text="#ylt#p#ygt#This process shows how restarts can be performed in order to find the optimal clusteringindependent of the initialization. #ylt#/p#ygt#"/>
<parameter key="logverbosity" value="warning"/>
<operator name="Gerar Dados" class="OperatorChain" expanded="yes">
<operator name="Light SN txRelev" class="ExampleSource">
<parameter key="attributes" value="/home/msandi/workspace/modelos/light/sn_10_txRelev/light_sn_10_txRelev.aml"/>
</operator>
<operator name="Filtrando Cluster" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="label"/>
<parameter key="invert_filter" value="true"/>
<parameter key="apply_on_special" value="true"/>
</operator>
</operator>
<operator name="KMeans Distância Euclidiana" class="KMeans">
<parameter key="k" value="3"/>
</operator>
<operator name="Marcando Cluster id como Rótulo" class="ChangeAttributeRole">
<parameter key="name" value="cluster"/>
<parameter key="target_role" value="label"/>
</operator>
<operator name="XValidationParallel" class="XValidationParallel" expanded="yes">
<parameter key="keep_example_set" value="true"/>
<parameter key="create_complete_model" value="true"/>
<parameter key="number_of_threads" value="4"/>
<operator name="DecisionTreeParallel" class="DecisionTreeParallel">
<parameter key="criterion" value="gini_index"/>
<parameter key="number_of_threads" value="4"/>
</operator>
<operator name="Testando o Modelo" class="OperatorChain" expanded="yes">
<operator name="Aplicando o Modelo" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="Performance" class="Performance">
</operator>
</operator>
</operator>
</operator>

With this model I want describe the clusters. So it was generated a tree with just three levels and six attributes. I already changed the confidence parameters to high level and nothing changed.
I need more attributes to describe each cluster. I am not concerned with accuracy, in this case.

Please, could you give me another suggestion?

Thanks for all,
Marcello

land · June 2009

Hi,
only to try different learning schemes. I'm sorry, but any further suggestions would need to take a look at the data. And this would be definitly beyond the scope of this forum.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Text Mining - Clustering Task - DISCOVER THE CONTENT OF EACH CLUSTER"

Answers