Predicting whether a product is a beverage or not using a csv

luiz_vidal · January 2018

Hi guys,

I am quite new to Rapid Miner and here is my "problem"

I want to build a process in which I have 2 columns in a csv file (Desc - Description and Bebidas - 0 or 1 ), I want to predict if a product is a beverage (portuguese for bebida) by the description. I have gotten here so far

My processAfter I pass through this transformation though I put a Random Forest algorithm, but somehow I'm not able to tell which column is the prediction column, I also tried with Naive Bayes. I mean, the algorithm choice itself isn't an issue, but after processing documents I would like a manner to transform it to data again in order to use it for the prediction. Can someone help me to do it the right way? I'm kind of stuck.. thanks in advance.
Follow below the xml of my process

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Bebidas_100" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../../Workbooks/Bebidas_100"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="8.0.001" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="85">
        <list key="function_descriptions">
          <parameter key="Description" value="lower(Desc)"/>
          <parameter key="É Bebida" value="if(Bebida==0,&quot;Não&quot;,&quot;Sim&quot;)"/>
        </list>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="8.0.001" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="136">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="Bebida.is_not_missing."/>
        </list>
      </operator>
      <operator activated="true" class="replace" compatibility="8.0.001" expanded="true" height="82" name="Replace" width="90" x="447" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Description"/>
        <parameter key="attributes" value="Description|É Bebida"/>
        <parameter key="regular_expression" value="[a-z]"/>
        <parameter key="replace_what" value="[-!0-9&quot;#$%&amp;'()*+,./:;&lt;=&gt;?@\[\\\]_`{|}~]"/&gt;
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="8.0.001" expanded="true" height="82" name="Nominal to Text" width="90" x="581" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Description"/>
        <parameter key="attributes" value="Description|É Bebida"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="715" y="136">
        <list key="specify_weights"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="7.5.000" expanded="true" height="103" name="Process Documents" width="90" x="916" y="34">
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"/>
          <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.5.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="514" y="34">
            <parameter key="file" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\stopwords.txt"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
          <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Retrieve Bebidas_100" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Thomas_Ott · January 2018

Your process is not quite what I'm used to when building text processing in RapidMiner. I don't understand what the Replace operator is doing? Is that supposed to help the tokenization? If so, you can select 'specify parameters' and paste it in there.

Rearranging it, I would do something like this.

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="7.6.003" expanded="true" height="68" name="Retrieve Bebidas_100" width="90" x="45" y="34">
        <parameter key="repository_entry" value="../../Workbooks/Bebidas_100"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.6.003" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="Bebida.is_not_missing."/>
        </list>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.6.003" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="34">
        <list key="function_descriptions">
          <parameter key="Description" value="lower(Desc)"/>
          <parameter key="É Bebida" value="if(Bebida==0,&quot;Não&quot;,&quot;Sim&quot;)"/>
        </list>
      </operator>
      <operator activated="true" class="replace" compatibility="7.6.003" expanded="true" height="82" name="Replace" width="90" x="581" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Description"/>
        <parameter key="attributes" value="Description|É Bebida"/>
        <parameter key="regular_expression" value="[a-z]"/>
        <parameter key="replace_what" value="[-!0-9&quot;#$%&amp;'()*+,./:;&lt;=&gt;?@\[\\\]_`{|}~]"/&gt;
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.6.003" expanded="true" height="82" name="Nominal to Text" width="90" x="715" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Description"/>
        <parameter key="attributes" value="Description|É Bebida"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="849" y="34">
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"/>
          <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.5.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="514" y="34">
            <parameter key="file" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\stopwords.txt"/>
          </operator>
          <operator activated="false" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="85">
            <description align="center" color="transparent" colored="false" width="126">You can save yourself one Generate Attributes entry by using this operator to lower the case of your text</description>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
          <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_role" compatibility="7.6.003" expanded="true" height="82" name="Set Role" width="90" x="447" y="34">
        <parameter key="attribute_name" value="É Bebida"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="concurrency:cross_validation" compatibility="7.6.003" expanded="true" height="145" name="Validation" width="90" x="983" y="34">
        <parameter key="sampling_type" value="stratified sampling"/>
        <process expanded="true">
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="7.6.003" expanded="true" height="82" name="Decision Tree" width="90" x="45" y="34"/>
          <connect from_port="training set" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
          <description align="left" color="green" colored="true" height="80" resized="true" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" compatibility="7.6.003" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance" compatibility="7.6.003" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
          <connect from_op="Performance" from_port="example set" to_port="test set results"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
          <description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).&lt;br/&gt;The performance is evaluated and sent to the operator results.</description>
        </process>
        <description align="center" color="transparent" colored="false" width="126">A cross-validation evaluating a decision tree model.</description>
      </operator>
      <connect from_op="Retrieve Bebidas_100" from_port="output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Validation" to_port="example set"/>
      <connect from_op="Validation" from_port="model" to_port="result 2"/>
      <connect from_op="Validation" from_port="performance 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Of course, swap out the decision tree algo for the one you want, but this process passes the label with classes '0' and '1' to the Text Processing and then trains on it using a Cross Validation. You might get horrible accuracy in the first pass but adjusting the pruning, algorithm, and parameter optimization all help.

kayman · January 2018

Did you make a label? Your prediction value needs to be defined as label by using the role operator. Otherwise the system has no clue which attribute to use as predictor.

luiz_vidal · January 2018

Yes I did a label, even the process suggests it as a "fix". The problem is that I want to predict the field "Bebida" and this field doesnt come along after the process documents operator. I have the description field (which can be 'AAAAA BBBBB CCCCC') I perform some cleansing process which transforms my field description into 'AAAA BBBB' for example.. then I transform it into documents, tokenize and pass it through a stop words process then.. after it come out from the process documents process I wanna predict if the 'AAAA BBBB' is a yes or no field.. that's it.

luiz_vidal · January 2018

Thomas, thanks for your reply.

Well, the replace process is cleansing a bit the description colum, the problem is that usually users allowed to register anything in this thing, so for example a coke (coca-cola here in Brazil) they would inform coke 300ml, coca-cola, coke1l, coke pack, coke@, coke.fanta and so on, this first replace is just removing the special characters in order to ease the process for the tokenizer. After the tokenizer I also remove the stop words such as (ml - mililiter, a, g, l, etc), so when the process documents task finish I would have a more cleaner description of a product in order to classify whether it is a beverage or not (0 - no, 1-yes).

By your experience, should I generate a binomial field with yes and no instead of 0 and 1 ?

luiz_vidal · January 2018

Thomas,

I imported your xml and that was exactly I was trying to do but I was being unable to.

The funny thing now is that the classifier is reaching 100% accuracy.. which I believe doesn't seem good, am I right?

By the nature of my data, which is a simple description column of a product (as I exemplified coke, broomstick, water, bla bla bla) and it needs to be classified as a Yes or No category of product, which would be the best algorithms to be run with and I wonder if I would provide a subset of my data you would be capable of helping me out discovering what is wrong that I'm doing over my process or help defining what is the right sequence of processes in order to correct classify as I imagine that 100% doesn't sound good.

Thomas_Ott · January 2018

My preference is to use 'yes' or 'no' instead of 1 and 0, but that's me. You can change that in your Generate Attributes by putting in "yes" and "no"

luiz_vidal · January 2018

Thomas,

I have another question, now regarding my data.

The algorithm is running fine and the accuracy is being 100% if I run a decision tree, as the data is unbalanced, but this is the nature of the data as for a wide range of products, around 5-10% will be beverage, as the others might be clothes, food , etc

Would would be the best way to split the data for more accurate accuracy?

Thomas_Ott · January 2018

Well 100% means your overfitting if you're using a Decision Tree. I just slapped that in there to show you as an example.

What you need to do is balance the data better, you can either try using the SMOTE operator in the Operator Toolbox or do some macros to extract the # of classes you have and pass those macros to a Sample operator. Note, the right thing to do is put the Sample operator (or SMOTE) inside the training side of the Cross Validation, not outside.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Predicting whether a product is a beverage or not using a csv

Best Answer

Answers