"WordList - ExampleSet conversion problem when using Bayes Classifier - text cat."

m1cros · November 2011

Hey Guys,

I am working on a small project, and basically want to categorize text using a Naive Bayes classifier.

What I have done - created a process that trains the bayes classifier (using processed data) and serializes the classifier using write Model.
Furthermore, I have also stored the word list that was generated by the process documents from data (by using Wordlist to Data and storing it as an ARFF).

The process I am working on, and which I'm having problems with is the model applier to the data. I have a file which has a single line of text (the document to be categorized). I feed it into process documents, and have to also feed the aforementioned word list. So I load the word list by read ArFF and connect it to the wordlist input port on the process documents. However an error is thrown saying 'Expected WordList but received ExampleSet'.

How do i generate a WordList from the data? Or is the way I am storing it incorrect.

Thanks for any help! I really appreciate it!

The following is my XML:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.011">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.011" expanded="true" name="Process">
    <process expanded="true" height="679" width="1079">
      <operator activated="true" class="read_model" compatibility="5.1.011" expanded="true" height="60" name="Read Classifier" width="90" x="447" y="30">
        <parameter key="model_file" value="/Users/Funky/Desktop/Imperial/Third Year/Group Project/adcat/training/bayes_classifier_adcat"/>
      </operator>
      <operator activated="true" class="read_csv" compatibility="5.1.011" expanded="true" height="60" name="Read Data" width="90" x="179" y="255">
        <parameter key="csv_file" value="/Users/Funky/Desktop/Imperial/Third Year/Group Project/adcat/training/bayes_input.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations"/>
        <parameter key="encoding" value="MacRoman"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="document.true.text.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="read_arff" compatibility="5.1.011" expanded="true" height="60" name="Read ARFF" width="90" x="179" y="120">
        <parameter key="data_file" value="/Users/Funky/Desktop/Imperial/Third Year/Group Project/adcat/training/process_documents_word_list.arff"/>
        <list key="data_set_meta_data_information"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.1.003" expanded="true" height="76" name="Process Documents from Data" width="90" x="380" y="165">
        <parameter key="vector_creation" value="Term Frequency"/>
        <list key="specify_weights"/>
        <process expanded="true" height="679" width="1079">
          <operator activated="true" class="text:transform_cases" compatibility="5.1.003" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
          <operator activated="true" class="text:tokenize" compatibility="5.1.003" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.1.003" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="380" y="30">
            <parameter key="min_chars" value="2"/>
            <parameter key="max_chars" value="80"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.003" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="514" y="30"/>
          <operator activated="true" class="text:stem_snowball" compatibility="5.1.003" expanded="true" height="60" name="Stem (Snowball)" width="90" x="648" y="30"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.1.003" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="782" y="30">
            <parameter key="max_length" value="4"/>
          </operator>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
          <connect from_op="Stem (Snowball)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.1.011" expanded="true" height="76" name="Apply Model" width="90" x="715" y="75">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Read Classifier" from_port="output" to_op="Apply Model" to_port="model"/>
      <connect from_op="Read Data" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Read ARFF" from_port="output" to_op="Process Documents from Data" to_port="word list"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

MariusHelf · December 2011

Hi m1cros,

in general, your process setup looks fine. However, you should strongly consider to store all objects you create in the repository using the Store operator or the process context. That way you can retrieve them later with the Retrieve operator for use in other processes. You can store basically anything in the repository: ExampleSets, Models, Performances, Wordlists etc. I don't know how you stored your wordlist in a file, but storing it in the repository also guarantees that the type of the object is remembered correctly.

You may also consider to store your data there, because repository access is usually faster that Read CSV, plus you can store metadata like attribute roles etc. in the repository and don't have to preprocess your data on each access.

Cheers, Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"WordList - ExampleSet conversion problem when using Bayes Classifier - text cat."

Answers