Options

"Clustering and writing back results into MySQl Database"

comsystocomsysto Member Posts: 7 Contributor II
edited May 2019 in Help
HI,

i did a webinar, few weeks ago.
So made Text clustering and i got 7 clusters with text.
So i want write back into the MYSQL Database to each Text Article and give a cluster.

For example.
t1 is cluster 1
t2 is cluster 2
t3 is cluster 1

So i have a new Column in my table and every article(Text) should get a culster.
But how to do. At the DatabaseExamplewriter i didn find such a option.

Regards
Stefan




Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Stefan,
    unfortunately I didn't understand where your problem is. The DatabaseExampleWriter will write the table of examples into a table in your database. If you only want to have a subset of the example set in your database, you will have to filter out the undesired attributes first. You could use the AttributeFilter for example.

    Greetings,
      Sebastian
  • Options
    comsystocomsysto Member Posts: 7 Contributor II
    Hi Sebastian,

    if i proceed the KMEANS i got the Cluster Model. At Folder View there are 9 different clusters, where the Articles  are classified. 
    In each Cluster i can see the ID's of the Text Files.
    So i want write the the specified Cluster for each Article into the Database. I created a new Column at the database, in this should be written the Cluster_x which Rapidminer has given.

    Regards
    Stefan
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    and if this is the case, where's your problem using the DatabaseExampleSetWriter?

    Greetings,
      Sebastian
  • Options
    comsystocomsysto Member Posts: 7 Contributor II
    Hi Sebastian,

    Yeah !!! It worked. It was so simple like you said.

    But i have a strange effect.
    At the first time i extracted each article from database in a text file. And placed the text file in a subdir which was given from the database.
    Because every article is categorized by the poster. So if i do clustering from the text files, i will get a different result then clustering from database with the same articles . Do you have any ideas why this happing ?

    Regards
    Stefan
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I have some suspicions, but this would be only a guess. Please post your process here, otherwise I cannot see what you are doing at all.

    Greetings,
      Sebastian
  • Options
    comsystocomsysto Member Posts: 7 Contributor II
    Hi Sebastian,

    so i made some screenshots

    http://img691.imageshack.us/img691/1376/database1.jpg

    http://img690.imageshack.us/img690/6488/text1d.jpg

    http://img690.imageshack.us/img690/6488/text1d.jpg

    If its not enough informations please let me know what you need.

    Regards
    Stefan
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    please post the XML of your process? I cannot see through the image to check the parameter's of the operators.

    Greetings,
      Sebastian
  • Options
    comsystocomsysto Member Posts: 7 Contributor II
    Hi Sebastian,

    ok, here the xml for the database input :

    <operator name="Root" class="Process" expanded="yes">
        <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
            <parameter key="database_url" value="jdbc:mysql://localhost:3306/commdwh"/>
            <parameter key="username" value="profiler"/>
            <parameter key="password" value="IVMwe4nxke2qk62hBnNkLg=="/>
            <parameter key="query" value="SELECT `CONTENT` FROM `DIM_ARTICLE` where ARTICLE_ID &lt;&gt; -1"/>
        </operator>
        <operator name="StringTextInput" class="StringTextInput" expanded="yes">
            <parameter key="filter_nominal_attributes" value="true"/>
            <parameter key="vector_creation" value="TermFrequency"/>
            <list key="namespaces">
            </list>
            <parameter key="create_text_visualizer" value="true"/>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="GermanStopwordFilter" class="GermanStopwordFilter">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
            </operator>
            <operator name="GermanStemmer" class="GermanStemmer">
            </operator>
        </operator>
        <operator name="Nominal2Binominal" class="Nominal2Binominal" activated="no">
        </operator>
        <operator name="Nominal2Numerical" class="Nominal2Numerical">
        </operator>
        <operator name="KMeans" class="KMeans">
            <parameter key="k" value="9"/>
        </operator>
        <operator name="ExampleVisualizer" class="ExampleVisualizer">
        </operator>
        <operator name="AttributeFilter" class="AttributeFilter">
            <parameter key="condition_class" value="attribute_name_filter"/>
            <parameter key="parameter_string" value="cluster"/>
        </operator>
        <operator name="DatabaseExampleSetWriter" class="DatabaseExampleSetWriter">
            <parameter key="database_url" value="jdbc:mysql://localhost:3306/commdwh"/>
            <parameter key="username" value="profiler"/>
            <parameter key="password" value="IVMwe4nxke2qk62hBnNkLg=="/>
            <parameter key="table_name" value="DIM_CLUSTER"/>
            <parameter key="overwrite_mode" value="overwrite"/>
            <parameter key="set_default_varchar_length" value="true"/>
        </operator>
    </operator>


    And here the xml script for text files input:

    <operator name="Root" class="Process" expanded="yes">
        <parameter key="logfile" value="C:\Dokumente und Einstellungen\Administrator\Eigene Dateien\rm_workspace\test.log"/>
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
              <parameter key="3D Visualisierung" value="C:\xampp\htdocs\pentaho\3D Visualisierung"/>
              <parameter key="Events" value="C:\xampp\htdocs\pentaho\Events"/>
              <parameter key="Facility Management" value="C:\xampp\htdocs\pentaho\Facility Management"/>
              <parameter key="Innenarchitektur" value="C:\xampp\htdocs\pentaho\Innenarchitektur"/>
              <parameter key="Jobs" value="C:\xampp\htdocs\pentaho\Jobs"/>
              <parameter key="Landschaftsarchitektur" value="C:\xampp\htdocs\pentaho\Landschaftsarchitektur"/>
              <parameter key="Lichtplanung" value="C:\xampp\htdocs\pentaho\Lichtplanung"/>
              <parameter key="Produkte" value="C:\xampp\htdocs\pentaho\Produkte"/>
              <parameter key="Stadtplanung" value="C:\xampp\htdocs\pentaho\Stadtplanung"/>
              <parameter key="Studium &amp; Ausbildung" value="C:\xampp\htdocs\pentaho\Studium &amp; Ausbildung"/>
              <parameter key="Wettbewerbe" value="C:\xampp\htdocs\pentaho\Wettbewerbe"/>
              <parameter key="News" value="C:\xampp\htdocs\pentaho\News"/>
              <parameter key="Architektur" value="C:\xampp\htdocs\pentaho\Architektur"/>
            </list>
            <parameter key="default_content_language" value="german"/>
            <parameter key="vector_creation" value="TermFrequency"/>
            <parameter key="output_word_list" value="C:\Dokumente und Einstellungen\Administrator\Eigene Dateien\rm_workspace\training_words.list"/>
            <parameter key="id_attribute_type" value="long"/>
            <list key="namespaces">
            </list>
            <parameter key="create_text_visualizer" value="true"/>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="GermanStopwordFilter" class="GermanStopwordFilter">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
                <parameter key="min_chars" value="3"/>
            </operator>
            <operator name="GermanStemmer" class="GermanStemmer">
            </operator>
        </operator>
        <operator name="KMeans" class="KMeans">
            <parameter key="k" value="9"/>
        </operator>
        <operator name="ExampleVisualizer" class="ExampleVisualizer">
        </operator>
    </operator>

    Regards
    Stefan
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    it seems to me, that you are working on different data: While the text input operator, reading from files will only use the text's itself, you are using each nominal attribute available for clustering, when loading from database. You should use the Nominal2String operator, to declare the text attribute as string and then uncheck the "filter_nominal" parameter. Then only the text is used and not each other nominal attribute, like label, path and so on.

    Greetings,
      Sebastian
  • Options
    comsystocomsysto Member Posts: 7 Contributor II
    HI Sebastian,

    i changed what you said, but still the same effect at the Database XML Model.

    <operator name="Root" class="Process" expanded="yes">
        <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
            <parameter key="database_url" value="jdbc:mysql://localhost:3306/commdwh"/>
            <parameter key="username" value="profiler"/>
            <parameter key="password" value="IVMwe4nxke2qk62hBnNkLg=="/>
            <parameter key="query" value="SELECT `CONTENT` FROM `DIM_ARTICLE` where ARTICLE_ID &lt;&gt; -1"/>
        </operator>
        <operator name="Nominal2String" class="Nominal2String">
        </operator>
        <operator name="StringTextInput" class="StringTextInput" expanded="yes">
            <parameter key="vector_creation" value="TermFrequency"/>
            <list key="namespaces">
            </list>
            <parameter key="create_text_visualizer" value="true"/>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
            <operator name="GermanStopwordFilter" class="GermanStopwordFilter">
            </operator>
            <operator name="TokenLengthFilter" class="TokenLengthFilter">
            </operator>
            <operator name="GermanStemmer" class="GermanStemmer">
            </operator>
        </operator>
        <operator name="Nominal2Binominal" class="Nominal2Binominal" activated="no">
        </operator>
        <operator name="Nominal2Numerical" class="Nominal2Numerical">
        </operator>
        <operator name="KMeans" class="KMeans">
            <parameter key="k" value="9"/>
        </operator>
        <operator name="ExampleVisualizer" class="ExampleVisualizer">
        </operator>
        <operator name="AttributeFilter" class="AttributeFilter">
            <parameter key="condition_class" value="attribute_name_filter"/>
            <parameter key="parameter_string" value="cluster"/>
        </operator>
        <operator name="DatabaseExampleSetWriter" class="DatabaseExampleSetWriter">
            <parameter key="database_url" value="jdbc:mysql://localhost:3306/commdwh"/>
            <parameter key="username" value="profiler"/>
            <parameter key="password" value="IVMwe4nxke2qk62hBnNkLg=="/>
            <parameter key="table_name" value="DIM_CLUSTER"/>
            <parameter key="overwrite_mode" value="overwrite"/>
            <parameter key="set_default_varchar_length" value="true"/>
        </operator>
    </operator>


    Regards
    Stefan
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    there's no obvious error in your process. Did you set a breakpoint after loading the data and checked if the attribute definitions were the same?

    Greetings,
      Sebastian
  • Options
    comsystocomsysto Member Posts: 7 Contributor II
    Hi Sebastian,

    now it's working. Don't know why  :-)
    Just another question. Is it possible to see a close a neighbor at kmeans is ?
    Like the text with id1 is closer to the cluster point as text width id2 from cluster_1?


    Regards
    Stefan
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    you could build the Pairwise similarity table using the ExampleSet2Similarity or ExampleSet2SimilaritiyExampleSet operator. This will list all pairwise distances. If you choose Euclideandistance, this is the same as used in KMeans.

    Greetings,
      Sebastian
Sign In or Register to comment.