[SOLVED] Join wordlists

kasper2304 · March 2013

Hi

Was reading a paper yesterday saying that some times it can be wise to do feature extraction separately on each class when doing text analysis. This I did by using two nodes for process documents from files, and then apply same setup on both, whereafter I merge the example sets. The results was very good...!

So now my problem is that I want to merge the two wordlists in order to apply the wordlist on the entire corpus, but I simply cannot figure out how to do it... Any suggestions?

Can see that the same question have been posted in another thread in January without any answer...

Best
Kasper

MariusHelf · March 2013

Hi Kasper,

there is currently no possibility to combine the actual wordlist output (wor) of Process Documents. But you are probably trying to combine the example outputs (exa), right? Probably you have tried Append, which does not work because both sets contain different attributes. Try a combination of Union and Replace Missing Values instead!

Btw, when referencing other posts, a link would be helpful

Best regards,
Marius

kasper2304 · March 2013

Hi Marius.

Well I am actually trying to figure out a way to combine the actual wordlists, because I need it later when I need to create the corpus I want to apply my model on... The thing is that I did actually combine the example sets like you suggest, and performed modeling on it, with very good results on my test set. Going from 60% on precision and recall to around 90% with linear SVM, tf-idf and a downsampled trainingset of 286 positives and 286 negatives. But if I cannot extract the exact same word vector from my entire corpus then my new methods is no use...:/

But... When thinking about it, what I might actually just want to do is to also create two examplesets of my corpus, and then merge them in the same manner I am with my trainingset... Am I right?

The link to the other post is below, as well as my setup of how to create one training set based on two process documents from files nodes.

http://rapid-i.com/rapidforum/index.php/topic,6086.0.html

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.000" expanded="true" name="Process">
    <process expanded="true" height="741" width="1016">
      <operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process dataset" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="1" value="/Users/kasper2304/Desktop/Msc. Marketing Thesis/Modeling/Trainingset/positive_cases"/>
        </list>
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prunde_below_percent" value="1.0"/>
        <parameter key="prune_above_percent" value="99.0"/>
        <process expanded="true" height="759" width="765">
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="120"/>
          <operator activated="true" class="text:extract_token_number" compatibility="5.2.004" expanded="true" height="60" name="Extract Token Number" width="90" x="45" y="210"/>
          <operator activated="true" class="text:extract_length" compatibility="5.2.004" expanded="true" height="60" name="Extract Length" width="90" x="45" y="300"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.2.004" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="179" y="300">
            <parameter key="max_length" value="3"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="210"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Extract Token Number" to_port="document"/>
          <connect from_op="Extract Token Number" from_port="document" to_op="Extract Length" to_port="document"/>
          <connect from_op="Extract Length" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="5.3.000" expanded="true" height="76" name="Filter .DS_Store" width="90" x="179" y="30">
        <parameter key="condition_class" value="attribute_value_filter"/>
        <parameter key="parameter_string" value="metadata_file = .DS_Store"/>
        <parameter key="invert_filter" value="true"/>
      </operator>
      <operator activated="true" class="text:process_document_from_file" compatibility="5.2.004" expanded="true" height="76" name="Process dataset (2)" width="90" x="45" y="120">
        <list key="text_directories">
          <parameter key="0" value="/Users/kasper2304/Desktop/Msc. Marketing Thesis/Modeling/Trainingset/negative_cases"/>
        </list>
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prunde_below_percent" value="1.0"/>
        <parameter key="prune_above_percent" value="99.0"/>
        <process expanded="true" height="789" width="805">
          <operator activated="true" class="text:tokenize" compatibility="5.2.004" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.2.004" expanded="true" height="60" name="Transform Cases (2)" width="90" x="45" y="120"/>
          <operator activated="true" class="text:extract_token_number" compatibility="5.2.004" expanded="true" height="60" name="Extract Token Number (2)" width="90" x="45" y="210"/>
          <operator activated="true" class="text:extract_length" compatibility="5.2.004" expanded="true" height="60" name="Extract Length (2)" width="90" x="45" y="300"/>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.2.004" expanded="true" height="60" name="Generate n-Grams (2)" width="90" x="179" y="300">
            <parameter key="max_length" value="3"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.004" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="179" y="210"/>
          <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Extract Token Number (2)" to_port="document"/>
          <connect from_op="Extract Token Number (2)" from_port="document" to_op="Extract Length (2)" to_port="document"/>
          <connect from_op="Extract Length (2)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
          <connect from_op="Generate n-Grams (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="5.3.000" expanded="true" height="76" name="Filter .DS_Store (2)" width="90" x="179" y="120">
        <parameter key="condition_class" value="attribute_value_filter"/>
        <parameter key="parameter_string" value="metadata_file = .DS_Store"/>
        <parameter key="invert_filter" value="true"/>
      </operator>
      <operator activated="true" class="union" compatibility="5.3.000" expanded="true" height="76" name="Union (2)" width="90" x="313" y="75"/>
      <operator activated="true" class="replace_missing_values" compatibility="5.3.000" expanded="true" height="94" name="Replace Missing Values" width="90" x="447" y="75">
        <parameter key="numeric_condition" value="&quot;?&quot;"/>
        <parameter key="default" value="zero"/>
        <list key="columns"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.3.000" expanded="true" height="76" name="Set Role" width="90" x="581" y="75">
        <parameter key="name" value="label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="store" compatibility="5.3.000" expanded="true" height="60" name="Store BoW" width="90" x="715" y="75">
        <parameter key="repository_entry" value="Pre-processing/Misc/FullDataset_TEST"/>
      </operator>
      <connect from_op="Process dataset" from_port="example set" to_op="Filter .DS_Store" to_port="example set input"/>
      <connect from_op="Filter .DS_Store" from_port="example set output" to_op="Union (2)" to_port="example set 1"/>
      <connect from_op="Process dataset (2)" from_port="example set" to_op="Filter .DS_Store (2)" to_port="example set input"/>
      <connect from_op="Filter .DS_Store (2)" from_port="example set output" to_op="Union (2)" to_port="example set 2"/>
      <connect from_op="Union (2)" from_port="union" to_op="Replace Missing Values" to_port="example set input"/>
      <connect from_op="Replace Missing Values" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Store BoW" to_port="input"/>
      <connect from_op="Store BoW" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

awchisholm · March 2013

Hello

I have merged word lists by doing some gymnastics as follows

Convert word lists to example sets
Keep the word attribute only in the example sets
Append the example sets
Remove duplicates
Convert the word attribute to be of type text
Create a word vector from this using process documents from data

Here's a simple example - hope it helps

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.005">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.005" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
        <parameter key="text" value="the quick brown fox jumped over the lazy dog&#10;"/>
        <parameter key="add label" value="true"/>
        <parameter key="label_value" value="positive"/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="300">
        <parameter key="text" value="the lazy dog was jumped over by the quick fox"/>
        <parameter key="add label" value="true"/>
        <parameter key="label_value" value="negative"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents" width="90" x="45" y="120">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize (3)" width="90" x="179" y="120"/>
          <connect from_port="document" to_op="Tokenize (3)" to_port="document"/>
          <connect from_op="Tokenize (3)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:wordlist_to_data" compatibility="5.3.000" expanded="true" height="76" name="WordList to Data" width="90" x="179" y="30"/>
      <operator activated="true" class="select_attributes" compatibility="5.3.005" expanded="true" height="76" name="Select Attributes" width="90" x="179" y="120">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|word"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents (2)" width="90" x="45" y="390">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" name="Tokenize (4)"/>
          <connect from_port="document" to_op="Tokenize (4)" to_port="document"/>
          <connect from_op="Tokenize (4)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:wordlist_to_data" compatibility="5.3.000" expanded="true" height="76" name="WordList to Data (2)" width="90" x="179" y="300"/>
      <operator activated="true" class="select_attributes" compatibility="5.3.005" expanded="true" height="76" name="Select Attributes (2)" width="90" x="179" y="390">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="|word"/>
      </operator>
      <operator activated="true" class="append" compatibility="5.3.005" expanded="true" height="94" name="Append" width="90" x="313" y="210"/>
      <operator activated="true" class="remove_duplicates" compatibility="5.3.005" expanded="true" height="76" name="Remove Duplicates" width="90" x="447" y="210"/>
      <operator activated="true" class="nominal_to_text" compatibility="5.3.005" expanded="true" height="76" name="Nominal to Text" width="90" x="581" y="210"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.000" expanded="true" height="76" name="Process Documents from Data" width="90" x="715" y="210">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize (5)" width="90" x="648" y="435"/>
          <connect from_port="document" to_op="Tokenize (5)" to_port="document"/>
          <connect from_op="Tokenize (5)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents (2)" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="word list" to_op="WordList to Data" to_port="word list"/>
      <connect from_op="WordList to Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Append" to_port="example set 1"/>
      <connect from_op="Process Documents (2)" from_port="word list" to_op="WordList to Data (2)" to_port="word list"/>
      <connect from_op="WordList to Data (2)" from_port="example set" to_op="Select Attributes (2)" to_port="example set input"/>
      <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
      <connect from_op="Append" from_port="merged set" to_op="Remove Duplicates" to_port="example set input"/>
      <connect from_op="Remove Duplicates" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

regards

Andrew

MariusHelf · March 2013

Thank you, creativity often pays out, even when doing seemingly straight-forward analysis stuff

kasper2304 · March 2013

I consider this question answered.

Thanks a lot for the help guys!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

[SOLVED] Join wordlists

Answers