generate a subset of wordlist based on a given weight table

winecoding · September 2016

I have generated a wordlist file based on the processing a document corpus. The following is a screenshot of part of the wordlist file.. Thare are around 15000 rows(15000 different tokenized words). Based on the feature selection method, I already have a list of words that should be kept. This list only contains 500 words, and is saved in the weight object. How can I join this two items, a wordlist and a weight table to generate a short wordlist which only has 500 rows.

MartinLiebig · September 2016

Hi,

it's relativly easy to filter the wordlist and get an example set with only those fullfilling a weight requierement. However, i don't know away except execute script to turn this into a wordlist again.

Could you explain why you need to do this, and why it is no option to use Select by Weights on the resulting table?

~Martin

winecoding · September 2016

Hi Mschmitz,

Thanks for the reply.

The following is the current prediction script. The Retrieve operator (circled with red) retrieve the original wordlist, which for instance has about 15000 words. By combining with stored weight the example set passed to Apply Model operator has a reduced size. However, if I can reduce the original wordlist offline. For instance, I get the reduced wordlist based on the stored weight table before launching this prediction script. The passed wordlist will be a filtered one, which has about 500 words based on the top weights. Then I don't need including the part (circled with yellow) altogether.

winecoding · September 2016

awchisholm · September 2016

It's possible to make a wordlist from an example set containing 500 examples each representing a word as follows

Convert the attribute containing the word to text using `Nominal to Text`
Use `Process Documents from Data` on this (no need to tokenize inside this operator)

Here's an example

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_nominal_data" compatibility="7.2.000" expanded="true" height="68" name="Generate Nominal Data" width="90" x="179" y="85"/>
      <operator activated="true" class="nominal_to_text" compatibility="7.2.000" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="85">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="att1"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="85">
        <list key="specify_weights"/>
        <process expanded="true">
          <connect from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Generate Nominal Data" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>

Thomas_Ott · September 2016

Heh, I hadn't thought about doing it this way but I think that works. You can then pass the weighted words back into a Process Documents to Data operator and then output the WordList for scoring. Sweet!

MartinLiebig · September 2016

Hi Andrew,

i do not think that this works like it should, since you would need to process the whole data again. if you only throw in the list of attributes as an example set, you would not get proper normalization factors for TF/IDF.

~Martin

Thomas_Ott · September 2016

I don't know what the use case is from the OP but maybe they don't need TFIDF, maybe the can use Binary Occurances?

winecoding · September 2016

Thank you for your response, let me try your suggestions. I use binary occurrence.

awchisholm · September 2016

Hello Martin

It's not exactly clear why the OP wants to do this - but the technique definately works if you want to create a word list from an example set that was originally derived from a word list but which has been reduced in some way. The last book chapter I wrote did this extensively.

regards

Andrew

winecoding · September 2016

Hi awchisholm,

Thank you for the reply. I just have one question regarding this approach.

I am saving the generated weight object into a csv file, and can keep the top 500 words, and make it as a text data file (each row represents a file) for Rapidminer to process. However, generating the wordlist object need the example set to have class information. The weight file itself does not have label information. The original training process is built for a six class categorization work. How can I solve this kind of discrepancy?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

generate a subset of wordlist based on a given weight table

Answers