Text Classification using Text Plugin

pserpser Member Posts: 8 Contributor II
edited September 2019 in Help


I am trying to classify texts stored in a database. I'd like to describe some of the problems I experienced and questions that came up. Since they adress different topics I decided to split the post into three parts. In this one I ask for your opinion: How would you design an experiment for text classification with RapidMiner? If anyone has built a similar experiment I would be very grateful if he could describe the setup he used.

The setup I have in mind at the moment is something like this:

<operator name="Root" class="Process" expanded="yes">
    <operator name="DatabaseExampleSource" class="DatabaseExampleSource">
        <parameter key="database_url" value="www.example.net"/>
        <parameter key="username" value="example"/>
    <operator name="StringTextInput" class="StringTextInput" expanded="no">
        <parameter key="default_content_encoding" value="UTF-8"/>
        <parameter key="default_content_type" value="html"/>
        <parameter key="filter_nominal_attributes" value="true"/>
        <parameter key="input_word_list" value="example.wordlist"/>
        <list key="namespaces">
        <parameter key="prune_above" value="5%"/>
        <parameter key="prune_below" value="3"/>
        <parameter key="remove_original_attributes" value="true"/>
        <operator name="StringTokenizer" class="StringTokenizer">
        <operator name="GermanStopwordFilter" class="GermanStopwordFilter">
        <operator name="TokenLengthFilter" class="TokenLengthFilter">
            <parameter key="max_chars" value="25"/>
            <parameter key="min_chars" value="3"/>
        <operator name="GermanStemmer" class="GermanStemmer">
    <operator name="XValidation" class="XValidation" expanded="yes">
        <parameter key="create_complete_model" value="true"/>
        <parameter key="number_of_validations" value="5"/>
        <operator name="W-NaiveBayesMultinomialUpdateable" class="W-NaiveBayesMultinomialUpdateable">
        <operator name="Testing" class="OperatorChain" expanded="yes">
            <operator name="ModelApplier" class="ModelApplier">
                <list key="application_parameters">
                <parameter key="keep_model" value="true"/>
            <operator name="ClassificationPerformance" class="ClassificationPerformance">
                <list key="class_weights">
                <parameter key="classification_error" value="true"/>
                <parameter key="correlation" value="true"/>
                <parameter key="keep_example_set" value="true"/>

This is just the part for learning the model. Of course normally a part where the model is applied to unlabeled data would follow. Later on I'd like to create the wordlist from the database entries (at the moment I work with a given wordlist) and use the UpdateModel operator to update the model incrementally with new labeled data. More about this in my other posts in "Problems and Support".


Sign In or Register to comment.