Classification of different comments

mavi16abmavi16ab Member Posts: 13 Contributor I
edited December 2018 in Help

Hey RapidMiner community!

 

So, I've been trying the last 10 hours or so to do, what I believe is a very simple task, yet I can't seem to get it right.

 

I have this Excel file, with the Columns ActionType | CreatedDate | ActorName | TextValue | Category

This file has around 14.000 rows.

 

I have manually entered a Category, based on the TextValue which is a Facebook comment.

I need RapidMiner to categorize the remaining rows from my file with a Category based on the TextValue.

 

How do I do this the best way?

Thanks a lot!

 

 

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @mavi16ab - welcome to the community.  I'm happy to help here but can you please give me a little more to go on?  If you could please post your current XML process (see "Read Before Posting" on right) and some sample rows of data, that would make things go better.


    Scott

     

  • mavi16abmavi16ab Member Posts: 13 Contributor I

    Hey @sgenzer

     

    Thanks a lot for your time to answering my question! Tbh, I'm not really sure how I post the process as XML file?

     

    But regarding your question for my data, it looks like this in the spreadsheet:

    Skærmbillede 2017-11-22 kl. 17.54.04.png

     

    As you can see, based on the TextValue, I choose a corresponding Category. I need RapidMiner to do this same process on all the comments which I have not categorized.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    ah I see.  OK that's pretty standard text mining.

     

    So for posting XML, the instructions are here when you post a message:

     

    Screen Shot 2017-11-22 at 12.00.21 PM.png

     

    And can you just attach that spreadsheet to a post?  You can do it here:

     

    Screen Shot 2017-11-22 at 12.00.21 PM.png

  • mavi16abmavi16ab Member Posts: 13 Contributor I

    @sgenzer alright, this is what I got for you.

     

    The process:

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve PokemonGo Data" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//PokemonGo/Data/PokemonGo Data"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Category.is_not_missing."/>
    </list>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Cross Validation" width="90" x="380" y="34">
    <process expanded="true">
    <operator activated="true" class="k_nn" compatibility="7.6.001" expanded="true" height="82" name="k-NN" width="90" x="112" y="34"/>
    <connect from_port="training set" to_op="k-NN" to_port="training set"/>
    <connect from_op="k-NN" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
    <list key="class_weights"/>
    </operator>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve PokemonGo Data" from_port="output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
    <connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
    <connect from_op="Cross Validation" from_port="test result set" to_port="result 3"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_port="result 4"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    </process>
    </operator>
    </process>

    And I have attached the spreadsheet for you.

    Thanks man.

     

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    haha PokemonGo.  Nice.  Well this should get in you in the right direction...

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve PokemonGoData" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//RapidMiner OneDrive/random community stuff/PokemonGoData"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Category.is_not_missing."/>
    </list>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="TextValue"/>
    </operator>
    <operator activated="true" class="nominal_to_date" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Date" width="90" x="447" y="34">
    <parameter key="attribute_name" value="CreatedDate"/>
    <parameter key="date_format" value="MM/dd/yy HH.mm"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="581" y="34">
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="percentual"/>
    <parameter key="prune_below_percent" value="2.0"/>
    <parameter key="prune_above_percent" value="35.0"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34"/>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34">
    <parameter key="min_chars" value="2"/>
    </operator>
    <connect from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
    <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="715" y="34">
    <parameter key="attribute_name" value="Category"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Cross Validation" width="90" x="849" y="34">
    <process expanded="true">
    <operator activated="false" class="k_nn" compatibility="7.6.001" expanded="true" height="82" name="k-NN" width="90" x="112" y="289"/>
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="112" y="34"/>
    <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Decision Tree" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
    <list key="class_weights"/>
    </operator>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Retrieve PokemonGoData" from_port="output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Nominal to Date" to_port="example set input"/>
    <connect from_op="Nominal to Date" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
    <connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
    <connect from_op="Cross Validation" from_port="test result set" to_port="result 3"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_port="result 4"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    </process>
    </operator>
    </process>

    Good luck!

     

    Scott

     

    EDIT: I should have added the Apply Model part.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.001" expanded="true" height="68" name="Retrieve PokemonGoData" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//RapidMiner OneDrive/random community stuff/PokemonGoData"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="TextValue"/>
    </operator>
    <operator activated="true" class="nominal_to_date" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Date" width="90" x="313" y="34">
    <parameter key="attribute_name" value="CreatedDate"/>
    <parameter key="date_format" value="MM/dd/yy HH.mm"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="percentual"/>
    <parameter key="prune_below_percent" value="2.0"/>
    <parameter key="prune_above_percent" value="35.0"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34"/>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34">
    <parameter key="min_chars" value="2"/>
    </operator>
    <connect from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
    <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="declare_missing_value" compatibility="7.6.001" expanded="true" height="82" name="Declare Missing Value" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Category"/>
    <parameter key="mode" value="nominal"/>
    <parameter key="nominal_value" value="?"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role" width="90" x="715" y="34">
    <parameter key="attribute_name" value="Category"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Examples" width="90" x="849" y="136">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Category.is_not_missing."/>
    </list>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Cross Validation" width="90" x="1050" y="34">
    <process expanded="true">
    <operator activated="false" class="k_nn" compatibility="7.6.001" expanded="true" height="82" name="k-NN" width="90" x="112" y="289"/>
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.001" expanded="true" height="82" name="Decision Tree" width="90" x="112" y="34"/>
    <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Decision Tree" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
    <list key="class_weights"/>
    </operator>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="1184" y="238">
    <list key="application_parameters"/>
    </operator>
    <connect from_op="Retrieve PokemonGoData" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Nominal to Date" to_port="example set input"/>
    <connect from_op="Nominal to Date" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Declare Missing Value" to_port="example set input"/>
    <connect from_op="Declare Missing Value" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Filter Examples" from_port="unmatched example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Cross Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 2"/>
    <connect from_op="Apply Model (2)" from_port="model" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>

     

     

  • mavi16abmavi16ab Member Posts: 13 Contributor I

    @sgenzer

     

    Alright, I tried to import your XML code, but when trying to use it I get a few errors. Do I need any plugins?

     

    This is what I get:

     

    Skærmbillede 2017-11-22 kl. 18.29.09.png

     

     

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    You need to install the free "Text Processing" extension.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • mavi16abmavi16ab Member Posts: 13 Contributor I

    @sgenzer

     

    As a complete rookie to all this, how do I improve the accuracy? As it stands now, it's about 60%, which is a bit better than what I achieved. How do I train it?

     

    Thanks for taking the time.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    thanks @Telcontar120 for answering the question for the text processing extension  :)

     

    as for how to train the model, that's a much bigger question than one thread can Screen Shot 2017-11-22 at 6.45.18 PM.pnggo here for tutorialsanswer.  I just wrote that process to whet your appetite.  :)  I would strongly suggest you a) go through the built-in tutorials in RapidMiner Studio, and b) go through the "Getting Started with RapidMiner" YouTube video series to begin to answer that.  RapidMiner makes data science fast and simple, but it does not do it for you.  We're always here to help on the community when you have questions.

     

     

     

     

     

    Good luck!

     

    Scott

     

    Screen Shot 2017-11-22 at 6.47.06 PM.pngYouTube playlist

  • mavi16abmavi16ab Member Posts: 13 Contributor I

    @sgenzer

     

    Sorry for contacting you through my thread, but I could find no way to PM you.

     

    I have been working and reading everything I can on the Rapidminer and text mining and classification in general, but no matter how many things I try, I CAN'T get an accuracy above 36% ??? Please help a desperate student in need. 

     

    EDIT: Nvm, I should be logged in before I get the chance to PM you. My bad

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    You're getting a bad model because your data is all over the place. I just loaded in @sgenzer's process and your @mavi16ab CSV file.

     

    From the looks of the categories you have 56% of your dataset as being "Other," what does that mean?  The other categories are so small in some cases that the model is suffering from a highly imbalanced dataset. Plus there's all kinds of missing data points too. I would suggest doing some missing value replacements where you can and trying to balance up the data set a bit.

     

    Just by getting rid of the missing values and cleaning up the data set I get almost 60% accuracy.  Text Processing is a lot of fun but there are so many ways to mess up your model. It requires a lot of up front thinking. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="7.6.002" expanded="true" height="68" name="Retrieve PokemonGoData (2)" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Local Repository/data/PokemonGoData"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.6.002" expanded="true" height="82" name="Select Attributes" width="90" x="112" y="238">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Category|CreatedDate|TextValue"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.6.002" expanded="true" height="103" name="Filter Examples (2)" width="90" x="246" y="238">
    <parameter key="condition_class" value="no_missing_attributes"/>
    <list key="filters_list"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.6.002" expanded="true" height="82" name="Nominal to Text" width="90" x="380" y="238">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="TextValue"/>
    </operator>
    <operator activated="true" class="nominal_to_date" compatibility="7.6.002" expanded="true" height="82" name="Nominal to Date" width="90" x="313" y="34">
    <parameter key="attribute_name" value="CreatedDate"/>
    <parameter key="date_format" value="MM/dd/yy HH.mm"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="percentual"/>
    <parameter key="prune_below_percent" value="2.0"/>
    <parameter key="prune_above_percent" value="35.0"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34"/>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
    <operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34">
    <parameter key="min_chars" value="2"/>
    </operator>
    <connect from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
    <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="declare_missing_value" compatibility="7.6.002" expanded="true" height="82" name="Declare Missing Value" width="90" x="581" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Category"/>
    <parameter key="mode" value="nominal"/>
    <parameter key="nominal_value" value="?"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.6.002" expanded="true" height="82" name="Set Role" width="90" x="715" y="34">
    <parameter key="attribute_name" value="Category"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="7.6.002" expanded="true" height="103" name="Filter Examples" width="90" x="849" y="136">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="Category.is_not_missing."/>
    </list>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.002" expanded="true" height="145" name="Cross Validation" width="90" x="1050" y="34">
    <process expanded="true">
    <operator activated="false" class="k_nn" compatibility="7.6.002" expanded="true" height="82" name="k-NN" width="90" x="112" y="289"/>
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.002" expanded="true" height="82" name="Decision Tree" width="90" x="112" y="34"/>
    <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Decision Tree" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.6.002" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.002" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
    <list key="class_weights"/>
    </operator>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.6.002" expanded="true" height="82" name="Apply Model (2)" width="90" x="1184" y="238">
    <list key="application_parameters"/>
    </operator>
    <connect from_op="Retrieve PokemonGoData (2)" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Filter Examples (2)" to_port="example set input"/>
    <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Nominal to Date" to_port="example set input"/>
    <connect from_op="Nominal to Date" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Declare Missing Value" to_port="example set input"/>
    <connect from_op="Declare Missing Value" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Filter Examples" from_port="unmatched example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Cross Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 2"/>
    <connect from_op="Apply Model (2)" from_port="model" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>
  • mavi16abmavi16ab Member Posts: 13 Contributor I

    Hey @Thomas_Ott

     

    Means so much, that you took some time to go through my thread - what an awesome community this is! Regarding the dataset, I figured the balancing of categories were very skewered, which is why I have corrected between my replies - should obviously have noted this. The "Other" category is kinda like a "catch-all", so that if a post doesn't fit in any of the categories it will go to "Other". To be honest, I don't have much knowelgde in optimizing the data, as it's actually used for a school business project, and I have no prior experince within this field, which I is why I have tried to learn as much as possible these last few weeks (to no avail).

     

    I did manage to get an accuracy of 70% myself, but then it just placed all in the "Other" category, which was obviosly not the point.

     

    Perhaps you could guide me through the steps you have taken?

     

    Thanks again man, means a lot!

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    The biggest problem I see is the amount of classes you have relative to the small sample size of each class. What I would look at is either 1) consolidating the classes into maybe a total of three or four classes, or 2) get more examples for each class. The learner is make too broad of a generalization for your data set, so it lumps everything into the "other" category. 

     

    Also, a word of caution. Don't rely solely on the 'accuracy' perfomance. It can be misleading when you have imbalanced datasets. Look at your precision and recall stats of the confusion matrix too. It will help you identify what classes are being correctly classified and which are not. That, in itself, can be a clue to help you build a better model.

  • mavi16abmavi16ab Member Posts: 13 Contributor I

    Thanks again @Thomas_Ott.

     

    I did follow your instructions, and added more examples to each of the categories, as you can see in my attached data sets. Still, not moving my accuracy by much, so I assume I must be doing some fundamentelly wrong. I've attached my data set, and hoping you could add some more insight.

     

    Thanks again!

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Like I said before, look at your classes again and work backwards from there. Do you really need that many classes? It appears that Spam class lumps in with the Other class a lot. Is there anything different about them?  Plus, have you tried and Ensemble model(s)? You could do a combination of Voting, Bagging, or Boosting using different algos. 

  • mavi16abmavi16ab Member Posts: 13 Contributor I

    @Thomas_Ott technically, I guess I don't need the spam category. Also, I was wondering if it would help to remove the "Other" category?

     

    You said, that you reached about 60% accuracy after some cleaning. What did you do?

     

    Thanks for your time and help.

Sign In or Register to comment.