"Clustering and writing back results into MySQl Database"

comsysto · November 2009

HI,

i did a webinar, few weeks ago.
So made Text clustering and i got 7 clusters with text.
So i want write back into the MYSQL Database to each Text Article and give a cluster.

For example.
t1 is cluster 1
t2 is cluster 2
t3 is cluster 1

So i have a new Column in my table and every article(Text) should get a culster.
But how to do. At the DatabaseExamplewriter i didn find such a option.

Regards
Stefan

land · November 2009

Hi Stefan,
unfortunately I didn't understand where your problem is. The DatabaseExampleWriter will write the table of examples into a table in your database. If you only want to have a subset of the example set in your database, you will have to filter out the undesired attributes first. You could use the AttributeFilter for example.

Greetings,
Sebastian

comsysto · November 2009

Hi Sebastian,

if i proceed the KMEANS i got the Cluster Model. At Folder View there are 9 different clusters, where the Articles are classified.
In each Cluster i can see the ID's of the Text Files.
So i want write the the specified Cluster for each Article into the Database. I created a new Column at the database, in this should be written the Cluster_x which Rapidminer has given.

Regards
Stefan

land · November 2009

Hi,
and if this is the case, where's your problem using the DatabaseExampleSetWriter?

Greetings,
Sebastian

comsysto · November 2009

Hi Sebastian,

Yeah !!! It worked. It was so simple like you said.

But i have a strange effect.
At the first time i extracted each article from database in a text file. And placed the text file in a subdir which was given from the database.
Because every article is categorized by the poster. So if i do clustering from the text files, i will get a different result then clustering from database with the same articles . Do you have any ideas why this happing ?

Regards
Stefan

land · November 2009

Hi,
I have some suspicions, but this would be only a guess. Please post your process here, otherwise I cannot see what you are doing at all.

Greetings,
Sebastian

comsysto · November 2009

Hi Sebastian,

so i made some screenshots

http://img691.imageshack.us/img691/1376/database1.jpg

http://img690.imageshack.us/img690/6488/text1d.jpg

http://img690.imageshack.us/img690/6488/text1d.jpg

If its not enough informations please let me know what you need.

Regards
Stefan

land · November 2009

Hi,
please post the XML of your process? I cannot see through the image to check the parameter's of the operators.

Greetings,
Sebastian

comsysto · November 2009

Hi Sebastian,

ok, here the xml for the database input :

<operator name="Root" class="Process" expanded="yes">
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_url" value="jdbc:mysql://localhost:3306/commdwh"/>
<parameter key="username" value="profiler"/>
<parameter key="password" value="IVMwe4nxke2qk62hBnNkLg=="/>
<parameter key="query" value="SELECT `CONTENT` FROM `DIM_ARTICLE` where ARTICLE_ID <> -1"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<parameter key="vector_creation" value="TermFrequency"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="GermanStopwordFilter" class="GermanStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
<operator name="GermanStemmer" class="GermanStemmer">
</operator>
</operator>
<operator name="Nominal2Binominal" class="Nominal2Binominal" activated="no">
</operator>
<operator name="Nominal2Numerical" class="Nominal2Numerical">
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="9"/>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer">
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="cluster"/>
</operator>
<operator name="DatabaseExampleSetWriter" class="DatabaseExampleSetWriter">
<parameter key="database_url" value="jdbc:mysql://localhost:3306/commdwh"/>
<parameter key="username" value="profiler"/>
<parameter key="password" value="IVMwe4nxke2qk62hBnNkLg=="/>
<parameter key="table_name" value="DIM_CLUSTER"/>
<parameter key="overwrite_mode" value="overwrite"/>
<parameter key="set_default_varchar_length" value="true"/>
</operator>
</operator>

And here the xml script for text files input:

<operator name="Root" class="Process" expanded="yes">
<parameter key="logfile" value="C:\Dokumente und Einstellungen\Administrator\Eigene Dateien\rm_workspace\test.log"/>
<operator name="TextInput" class="TextInput" expanded="yes">
<list key="texts">
<parameter key="3D Visualisierung" value="C:\xampp\htdocs\pentaho\3D Visualisierung"/>
<parameter key="Events" value="C:\xampp\htdocs\pentaho\Events"/>
<parameter key="Facility Management" value="C:\xampp\htdocs\pentaho\Facility Management"/>
<parameter key="Innenarchitektur" value="C:\xampp\htdocs\pentaho\Innenarchitektur"/>
<parameter key="Jobs" value="C:\xampp\htdocs\pentaho\Jobs"/>
<parameter key="Landschaftsarchitektur" value="C:\xampp\htdocs\pentaho\Landschaftsarchitektur"/>
<parameter key="Lichtplanung" value="C:\xampp\htdocs\pentaho\Lichtplanung"/>
<parameter key="Produkte" value="C:\xampp\htdocs\pentaho\Produkte"/>
<parameter key="Stadtplanung" value="C:\xampp\htdocs\pentaho\Stadtplanung"/>
<parameter key="Studium & Ausbildung" value="C:\xampp\htdocs\pentaho\Studium & Ausbildung"/>
<parameter key="Wettbewerbe" value="C:\xampp\htdocs\pentaho\Wettbewerbe"/>
<parameter key="News" value="C:\xampp\htdocs\pentaho\News"/>
<parameter key="Architektur" value="C:\xampp\htdocs\pentaho\Architektur"/>
</list>
<parameter key="default_content_language" value="german"/>
<parameter key="vector_creation" value="TermFrequency"/>
<parameter key="output_word_list" value="C:\Dokumente und Einstellungen\Administrator\Eigene Dateien\rm_workspace\training_words.list"/>
<parameter key="id_attribute_type" value="long"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="GermanStopwordFilter" class="GermanStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
<parameter key="min_chars" value="3"/>
</operator>
<operator name="GermanStemmer" class="GermanStemmer">
</operator>
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="9"/>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer">
</operator>
</operator>

Regards
Stefan

land · November 2009

Hi,
it seems to me, that you are working on different data: While the text input operator, reading from files will only use the text's itself, you are using each nominal attribute available for clustering, when loading from database. You should use the Nominal2String operator, to declare the text attribute as string and then uncheck the "filter_nominal" parameter. Then only the text is used and not each other nominal attribute, like label, path and so on.

Greetings,
Sebastian

comsysto · December 2009

HI Sebastian,

i changed what you said, but still the same effect at the Database XML Model.

<operator name="Root" class="Process" expanded="yes">
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_url" value="jdbc:mysql://localhost:3306/commdwh"/>
<parameter key="username" value="profiler"/>
<parameter key="password" value="IVMwe4nxke2qk62hBnNkLg=="/>
<parameter key="query" value="SELECT `CONTENT` FROM `DIM_ARTICLE` where ARTICLE_ID <> -1"/>
</operator>
<operator name="Nominal2String" class="Nominal2String">
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="vector_creation" value="TermFrequency"/>
<list key="namespaces">
</list>
<parameter key="create_text_visualizer" value="true"/>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
<operator name="GermanStopwordFilter" class="GermanStopwordFilter">
</operator>
<operator name="TokenLengthFilter" class="TokenLengthFilter">
</operator>
<operator name="GermanStemmer" class="GermanStemmer">
</operator>
</operator>
<operator name="Nominal2Binominal" class="Nominal2Binominal" activated="no">
</operator>
<operator name="Nominal2Numerical" class="Nominal2Numerical">
</operator>
<operator name="KMeans" class="KMeans">
<parameter key="k" value="9"/>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer">
</operator>
<operator name="AttributeFilter" class="AttributeFilter">
<parameter key="condition_class" value="attribute_name_filter"/>
<parameter key="parameter_string" value="cluster"/>
</operator>
<operator name="DatabaseExampleSetWriter" class="DatabaseExampleSetWriter">
<parameter key="database_url" value="jdbc:mysql://localhost:3306/commdwh"/>
<parameter key="username" value="profiler"/>
<parameter key="password" value="IVMwe4nxke2qk62hBnNkLg=="/>
<parameter key="table_name" value="DIM_CLUSTER"/>
<parameter key="overwrite_mode" value="overwrite"/>
<parameter key="set_default_varchar_length" value="true"/>
</operator>
</operator>

Regards
Stefan

land · December 2009

Hi,
there's no obvious error in your process. Did you set a breakpoint after loading the data and checked if the attribute definitions were the same?

Greetings,
Sebastian

comsysto · December 2009

Hi Sebastian,

now it's working. Don't know why :-)
Just another question. Is it possible to see a close a neighbor at kmeans is ?
Like the text with id1 is closer to the cluster point as text width id2 from cluster_1?

Regards
Stefan

land · December 2009

Hi,
you could build the Pairwise similarity table using the ExampleSet2Similarity or ExampleSet2SimilaritiyExampleSet operator. This will list all pairwise distances. If you choose Euclideandistance, this is the same as used in KMeans.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Clustering and writing back results into MySQl Database"

Answers