Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

I need help for improve my Confusion Matrix

luistopsluistops Member Posts: 4 Contributor I
edited November 2018 in Help

Hi im new in Rapidminer, and i start a project to clasificate news in 2 types, 1= gender violence and 2= generic news.

This is my transforms etc that i used for :

https://gyazo.com/aae6e06bc6f1f7623881876b6ed06d23

This is my procesing text transformations:

https://gyazo.com/ed3d19158038d8c52f02db98ae13ba67

My cross validation:

https://gyazo.com/af702c35350dcd735ae61bf2f2322586

My result:

https://gyazo.com/a7752521e886c68be0490d3150b78635

 

Myquestion is how can i improve my result, and also for example if i choose a animal violence news, rapidminer fail and select a animal violence news as gender violence, iit is possible to create a rule so that in case it finds for example the word Animal, Dog, etc classify it as generic news?

Thanks for help, sorry for my english

Tagged:

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @luistops - welcome to the Community.  We'd be happy to help you with this.  Can you please post your process XML in this thread using the </> button?  Thanks.


    Scott

     

  • luistopsluistops Member Posts: 4 Contributor I
    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="112" y="34">
    <list key="text_directories">
    <parameter key="Genericas" value="C:\Noticias Genéricas"/>
    <parameter key="Violencia" value="C:\Noticias de Genero"/>
    </list>
    <parameter key="file_pattern" value="*txt"/>
    <parameter key="extract_text_only" value="true"/>
    <parameter key="use_file_extension_as_type" value="true"/>
    <parameter key="content_type" value="txt"/>
    <parameter key="encoding" value="UTF-8"/>
    <parameter key="create_word_vector" value="true"/>
    <parameter key="vector_creation" value="TF-IDF"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="false"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prune_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="data_management" value="auto"/>
    <process expanded="true">
    <operator activated="true" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace Tokens" width="90" x="112" y="136">
    <list key="replace_dictionary">
    <parameter key="titleHTML;desHTML;fechaHTML" value=" "/>
    <parameter key="TituloLimpio;Descripcion;FechaF" value=" "/>
    </list>
    </operator>
    <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="34">
    <parameter key="transform_to" value="lower case"/>
    </operator>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="447" y="34">
    <parameter key="mode" value="non letters"/>
    <parameter key="characters" value=".:"/>
    <parameter key="language" value="English"/>
    <parameter key="max_token_length" value="3"/>
    </operator>
    <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.5.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="380" y="136">
    <parameter key="file" value="D:\stopWordsSpanish.txt"/>
    <parameter key="case_sensitive" value="false"/>
    <parameter key="encoding" value="UTF-8"/>
    </operator>
    <operator activated="true" class="text:stem_snowball" compatibility="7.5.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="380" y="238">
    <parameter key="language" value="Spanish"/>
    </operator>
    <connect from_port="document" to_op="Replace Tokens" to_port="document"/>
    <connect from_op="Replace Tokens" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
    <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
    <connect from_op="Stem (Snowball)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="112" y="380">Type your comment</description>
    </process>
    </operator>
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Cross Validation" width="90" x="447" y="34">
    <parameter key="split_on_batch_attribute" value="false"/>
    <parameter key="leave_one_out" value="false"/>
    <parameter key="number_of_folds" value="9"/>
    <parameter key="sampling_type" value="automatic"/>
    <parameter key="use_local_random_seed" value="false"/>
    <parameter key="local_random_seed" value="1992"/>
    <parameter key="enable_parallel_execution" value="true"/>
    <process expanded="true">
    <operator activated="true" class="k_nn" compatibility="7.6.001" expanded="true" height="82" name="k-NN" width="90" x="179" y="34">
    <parameter key="k" value="2"/>
    <parameter key="weighted_vote" value="false"/>
    <parameter key="measure_types" value="MixedMeasures"/>
    <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
    <parameter key="nominal_measure" value="NominalDistance"/>
    <parameter key="numerical_measure" value="EuclideanDistance"/>
    <parameter key="divergence" value="GeneralizedIDivergence"/>
    <parameter key="kernel_type" value="radial"/>
    <parameter key="kernel_gamma" value="1.0"/>
    <parameter key="kernel_sigma1" value="1.0"/>
    <parameter key="kernel_sigma2" value="0.0"/>
    <parameter key="kernel_sigma3" value="2.0"/>
    <parameter key="kernel_degree" value="3.0"/>
    <parameter key="kernel_shift" value="1.0"/>
    <parameter key="kernel_a" value="1.0"/>
    <parameter key="kernel_b" value="0.0"/>
    </operator>
    <connect from_port="training set" to_op="k-NN" to_port="training set"/>
    <connect from_op="k-NN" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="11" y="419">Diccionario de sinonimos&lt;br/&gt;Si aparece la palabra perro no es de violencia de genero=violencia animal&lt;br/&gt;</description>
    <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="137" y="403">Type your comment</description>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    <parameter key="create_view" value="false"/>
    </operator>
    <operator activated="true" class="performance_binominal_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (2)" width="90" x="179" y="187">
    <parameter key="main_criterion" value="accuracy"/>
    <parameter key="accuracy" value="true"/>
    <parameter key="classification_error" value="true"/>
    <parameter key="kappa" value="true"/>
    <parameter key="AUC (optimistic)" value="true"/>
    <parameter key="AUC" value="true"/>
    <parameter key="AUC (pessimistic)" value="true"/>
    <parameter key="precision" value="true"/>
    <parameter key="recall" value="true"/>
    <parameter key="lift" value="true"/>
    <parameter key="fallout" value="true"/>
    <parameter key="f_measure" value="true"/>
    <parameter key="false_positive" value="true"/>
    <parameter key="false_negative" value="true"/>
    <parameter key="true_positive" value="true"/>
    <parameter key="true_negative" value="true"/>
    <parameter key="sensitivity" value="true"/>
    <parameter key="specificity" value="true"/>
    <parameter key="youden" value="true"/>
    <parameter key="positive_predictive_value" value="true"/>
    <parameter key="negative_predictive_value" value="true"/>
    <parameter key="psep" value="true"/>
    <parameter key="skip_undefined_labels" value="true"/>
    <parameter key="use_example_weights" value="true"/>
    </operator>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
    <connect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Process Documents from Files" from_port="example set" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Cross Validation" from_port="example set" to_port="result 1"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @luistops - ok thanks for that.  So for starters to improve your model I would encourage you to optimize your parameters (see https://www.youtube.com/watch?v=nXjjAA2mDMY).  As for reducing/combining classes, I'd need to see your dataset to understand better what you're trying to do.

     

    Scott

  • luistopsluistops Member Posts: 4 Contributor I

    Ok thank you so much for help, i include my datashet. my proyect just try to diference and clasificate news in 2 categories:

    1. No gender violence

    2. Gender violence

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @luistops - ok I think I get your idea.  I did not have your Spanish stopword dictionary so I just greyed that out.  As for animal, dog, etc.. I put some replace token lines in your Process Documents operator that perhaps (?) is what you're looking for.  Note that the "test set" I created is rather small so don't get too excited by your 100% accuracy result.  :)

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="112" y="34">
    <list key="text_directories">
    <parameter key="Genericas" value="/Users/GenzerConsulting/OneDrive - RapidMiner/OneDrive Repository/random community stuff/newsfromhtml1/No gender violence news"/>
    <parameter key="Violencia" value="/Users/GenzerConsulting/OneDrive - RapidMiner/OneDrive Repository/random community stuff/newsfromhtml1/Gender violence news"/>
    </list>
    <parameter key="file_pattern" value="*txt"/>
    <parameter key="encoding" value="UTF-8"/>
    <process expanded="true">
    <operator activated="true" class="text:replace_tokens" compatibility="7.5.000" expanded="true" height="68" name="Replace Tokens" width="90" x="45" y="34">
    <list key="replace_dictionary">
    <parameter key="titleHTML;desHTML;fechaHTML" value=" "/>
    <parameter key="TituloLimpio;Descripcion;FechaF" value=" "/>
    <parameter key="animal.*violence" value="generic news"/>
    <parameter key="dog.*violence" value="generic news"/>
    </list>
    </operator>
    <operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
    <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="313" y="34"/>
    <operator activated="false" class="text:filter_stopwords_dictionary" compatibility="7.5.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="380" y="136">
    <parameter key="file" value="D:\stopWordsSpanish.txt"/>
    <parameter key="encoding" value="UTF-8"/>
    </operator>
    <operator activated="true" class="text:stem_snowball" compatibility="7.5.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="447" y="34">
    <parameter key="language" value="Spanish"/>
    </operator>
    <connect from_port="document" to_op="Replace Tokens" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
    <connect from_op="Stem (Snowball)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="112" y="380">Type your comment</description>
    </process>
    </operator>
    <operator activated="true" class="split_data" compatibility="7.6.001" expanded="true" height="103" name="Split Data" width="90" x="246" y="289">
    <enumeration key="partitions">
    <parameter key="ratio" value="0.9"/>
    <parameter key="ratio" value="0.1"/>
    </enumeration>
    </operator>
    <operator activated="true" class="optimize_parameters_grid" compatibility="7.6.001" expanded="true" height="145" name="Optimize Parameters (Grid)" width="90" x="514" y="34">
    <list key="parameters">
    <parameter key="k-NN.k" value="[2;10;10;linear]"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.001" expanded="true" height="145" name="Cross Validation" width="90" x="45" y="34">
    <parameter key="number_of_folds" value="9"/>
    <process expanded="true">
    <operator activated="true" class="k_nn" compatibility="7.6.001" expanded="true" height="82" name="k-NN" width="90" x="112" y="34">
    <parameter key="k" value="10"/>
    </operator>
    <connect from_port="training set" to_op="k-NN" to_port="training set"/>
    <connect from_op="k-NN" from_port="model" to_port="model"/>
    <portSpacing port="source_training set" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="11" y="419">Diccionario de sinonimos&lt;br/&gt;Si aparece la palabra perro no es de violencia de genero=violencia animal&lt;br/&gt;</description>
    <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="137" y="403">Type your comment</description>
    </process>
    <process expanded="true">
    <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_binominal_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (train)" width="90" x="179" y="34">
    <parameter key="main_criterion" value="accuracy"/>
    <parameter key="classification_error" value="true"/>
    <parameter key="kappa" value="true"/>
    <parameter key="AUC (optimistic)" value="true"/>
    <parameter key="AUC" value="true"/>
    <parameter key="AUC (pessimistic)" value="true"/>
    <parameter key="precision" value="true"/>
    <parameter key="recall" value="true"/>
    <parameter key="lift" value="true"/>
    <parameter key="fallout" value="true"/>
    <parameter key="f_measure" value="true"/>
    <parameter key="false_positive" value="true"/>
    <parameter key="false_negative" value="true"/>
    <parameter key="true_positive" value="true"/>
    <parameter key="true_negative" value="true"/>
    <parameter key="sensitivity" value="true"/>
    <parameter key="specificity" value="true"/>
    <parameter key="youden" value="true"/>
    <parameter key="positive_predictive_value" value="true"/>
    <parameter key="negative_predictive_value" value="true"/>
    <parameter key="psep" value="true"/>
    </operator>
    <connect from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_op="Performance (train)" to_port="labelled data"/>
    <connect from_op="Performance (train)" from_port="performance" to_port="performance 1"/>
    <portSpacing port="source_model" spacing="0"/>
    <portSpacing port="source_test set" spacing="0"/>
    <portSpacing port="source_through 1" spacing="0"/>
    <portSpacing port="sink_test set results" spacing="0"/>
    <portSpacing port="sink_performance 1" spacing="0"/>
    <portSpacing port="sink_performance 2" spacing="0"/>
    </process>
    </operator>
    <connect from_port="input 1" to_op="Cross Validation" to_port="example set"/>
    <connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
    <connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
    <connect from_op="Cross Validation" from_port="performance 1" to_port="performance"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_performance" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.6.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="715" y="289">
    <list key="application_parameters"/>
    </operator>
    <operator activated="true" class="performance_classification" compatibility="7.6.001" expanded="true" height="82" name="Performance (test)" width="90" x="916" y="289">
    <list key="class_weights"/>
    </operator>
    <connect from_op="Process Documents from Files" from_port="example set" to_op="Split Data" to_port="example set"/>
    <connect from_op="Split Data" from_port="partition 1" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
    <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 2"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 3"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_op="Apply Model (2)" to_port="model"/>
    <connect from_op="Optimize Parameters (Grid)" from_port="result 2" to_port="result 1"/>
    <connect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (test)" to_port="labelled data"/>
    <connect from_op="Performance (test)" from_port="performance" to_port="result 4"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    </process>
    </operator>
    </process>

    Scott

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Just looking through your initial confusion matrix you had some decent results off the bat. What k value were you using? 

     

    In conjuction with what @sgenzer said, optimization is a must step. I would also look at using the Deep Learning algo as well as a LinearSVM too. 

  • luistopsluistops Member Posts: 4 Contributor I

    Thank you for the help i used k=2, now i have one doubt i have 2 result, but i dont understand where its come from, i have this result https://gyazo.com/00d0fc3aa5ff57cc7eb20b8d6bc45bca thats its great.

    And i have this one https://gyazo.com/79891aae0d4958d6d9b790d7bc934557

    Whats is the diferent of this result because i think they use the same algorithm to calculate the result

    Again thanks :)

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    So there's another tab in the Results view that you need to look at: ParameterSet (Optimize Parameters (Grid)).  This will tell you the value of k (assuming you used my process) that was optimal.

     

    Scott

     

Sign In or Register to comment.