RapidMiner

Binary text classification - Help in process needed.

Contributor II

Binary text classification - Help in process needed.

Hey guys,

 

We want to do a binary classification on a text data set with the distribution 80% negative class, 20% positive class. In order to reach maximum statistical meaningfulness, we want to do so by using 10-fold cross validation.

 

If we model this within Rapidminer, we are unsuccessful since it doesn’t output any statistical metrics (like precision, recall, etc):

 

Bildschirmfoto 2016-12-01 um 12.14.37.pngBildschirmfoto 2016-12-01 um 12.15.34.png

 

 

We found a workaround that works, but it doesn’t make any sense out of a ML perspective: If we first divide into training or test and then use 10-fold-crossvalidation it works — But the training or test split should be part of the crossvaligdation (9 training folds, 1 test fold, 10 iterations). So right now the only way to get this working is by FIRST dividing into test and training and THEN use X-Validation. Did we model it the right way or did we miss anything?

 

Bildschirmfoto 2016-12-01 um 12.14.37.pngBildschirmfoto 2016-12-01 um 12.15.01.pngBildschirmfoto 2016-12-01 um 12.15.34.png

 

 

If you need any more information for helping us, just comment.

Thank you very much in advanced.

 

Best regards!

See more topics labeled with:

15 REPLIES
Moderator

Re: Binary text classification - Help in process needed.

Ok, silly questions but did you set a label role in your data set? 

Elite III

Re: Binary text classification - Help in process needed.

This sounds like a strange problem, but it's very hard to troubleshoot from a screenshot of a process--can you post the process itself for review?  You can export it from the file menu and attach it as a file.  

Thanks,

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Contributor II

Re: Binary text classification - Help in process needed.

Hey T-Bone,

 

yes  I set a label role Smiley Wink

 

Regards,

 

Contributor II

Re: Binary text classification - Help in process needed.

Hey Brian,

 

thank you for your answer.

Here is the process which gives me results but makes no sense Smiley Wink

 

It would be great if you could help me. If you need any more information I am happy to provide them Smiley Wink

 

Best regards,
Thiemo

Attachments

Moderator

Re: Binary text classification - Help in process needed.

I would double check your process, something doesn't appear to be correct because I can easily extract P/R's and confusion matrix.

 

See the sample XML below.  This process takes Tweets, does a bit of processing up front and generates a random label. The Process Documents from Data operator then processes them to TF-IDF (you can select Binary Occurances) and spits out the confusion matrix. 

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="112" y="34">
        <parameter key="connection" value="Twitter Connection"/>
        <parameter key="query" value="rapidminer"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.3.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Id|Text"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.3.000" expanded="true" height="82" name="Nominal to Text" width="90" x="380" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.3.000" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="34">
        <list key="function_descriptions">
          <parameter key="label" value="if(rand()&lt;0.5,&quot;good&quot;,&quot;bad&quot;)"/>
        </list>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.3.000" expanded="true" height="82" name="Set Role" width="90" x="648" y="34">
        <parameter key="attribute_name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.2.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="782" y="34">
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.2.001" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="7.3.000" expanded="true" height="145" name="Validation" width="90" x="916" y="34">
        <parameter key="sampling_type" value="stratified sampling"/>
        <process expanded="true">
          <operator activated="true" class="parallel_decision_tree" compatibility="7.3.000" expanded="true" height="82" name="Decision Tree" width="90" x="45" y="34"/>
          <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
          <description align="left" color="green" colored="true" height="80" resized="true" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="7.3.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="7.3.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <connect from_op="Performance" from_port="example set" to_port="test set results"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
          <description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).&lt;br/&gt;The performance is evaluated and sent to the operator results.</description>
        </process>
        <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
      </operator>
      <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Validation" to_port="example set"/>
      <connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
Elite III

Re: Binary text classification - Help in process needed.

Hi @thiemo,

I took your original process, and modified it only by inputting a simple toy example set using the identical Excel format (since I don't have your original dataset).  Then I removed your outer split validation, and ran it again only using the cross-validation that you had as an inner operator.  And it works fine!  Here's the modified process.  So if you are having problems, I suspect it must be something strange related to your original dataset.  There's nothing that appears to be wrong with the process or with the cross-validation operator. Sorry I couldn't be more definitive.

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts

Attachments

Elite III

Re: Binary text classification - Help in process needed.

And here's the Excel file I used as input in case you are interested.

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts

Attachments

Contributor II

Re: Binary text classification - Help in process needed.

Hey Brian,

 

thank you very much for your solution. I downloaded the process and the excel and tried it and it works perfectly, but I do not get the performance parameters such as accurancy, recall, precision and the AUC?

 

How can I use this process and receive those 4 parameters?

 

Regards,


Thiemo

Elite III

Re: Binary text classification - Help in process needed.

Hi @thiemo,

I'm not sure what you mean--those performance metrics are all available in the performance tab output from the process when it runs.  See the attached screenshot.  This is part of the output for the process I supplied with no changes.  Of course, the values are useless with my test examples since there are only 10 of them, but you can see that AUC, accuracy, precision, and recall are all available.  If you run it on a larger dataset then they should all be there.

 

performance output.PNG

 

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts