generate a subset of wordlist based on a given weight table

winecodingwinecoding Member Posts: 6 Contributor I
edited November 2018 in Help

I have generated a wordlist file based on the processing a document corpus. The following is a screenshot of part of the wordlist file.. Thare are around 15000 rows(15000 different tokenized words). Based on the feature selection method, I already have a list of words that should be kept. This list only contains 500 words, and is saved in the weight object. How can I join this two items, a wordlist and a weight table to generate a short wordlist which only has 500 rows.

 

Capture.JPG

Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    it's relativly easy to filter the wordlist and get an example set with only those fullfilling a weight requierement. However, i don't know away except execute script to turn this into a wordlist again.

    Could you explain why you need to do this, and why it is no option to use Select by Weights on the resulting table?

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • winecodingwinecoding Member Posts: 6 Contributor I

    Hi Mschmitz,

     

    Thanks for the reply.

     

    The following is the current prediction script. The Retrieve operator (circled with red) retrieve the original wordlist, which for instance has about 15000 words. By combining with stored weight the example set passed to Apply Model operator has a reduced size. However, if I can reduce the original wordlist offline. For instance, I get the reduced wordlist based on the stored weight table before launching this prediction script. The passed wordlist will be a filtered one, which has about 500 words based on the top weights. Then I don't need including the part (circled with yellow) altogether.

    Capture.JPG

     

  • winecodingwinecoding Member Posts: 6 Contributor I
  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn

    It's possible to make a wordlist from an example set containing 500 examples each representing a word as follows

     

    1. Convert the attribute containing the word to text using `Nominal to Text`
    2. Use `Process Documents from Data` on this (no need to tokenize inside this operator)

     

    Here's an example 

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_nominal_data" compatibility="7.2.000" expanded="true" height="68" name="Generate Nominal Data" width="90" x="179" y="85"/>
    <operator activated="true" class="nominal_to_text" compatibility="7.2.000" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="85">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="att1"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="85">
    <list key="specify_weights"/>
    <process expanded="true">
    <connect from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Generate Nominal Data" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
    <connect from_op="Process Documents from Data" from_port="word list" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Heh, I hadn't thought about doing it this way but I think that works. You can then pass the weighted words back into a Process Documents to Data operator and then output the WordList for scoring. Sweet!

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi Andrew,

     

    i do not think that this works like it should, since you would need to process the whole data again. if you only throw in  the list of attributes as an example set, you would not get proper normalization factors for TF/IDF.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    I don't know what the use case is from the OP but maybe they don't need TFIDF, maybe the can use Binary Occurances?

  • winecodingwinecoding Member Posts: 6 Contributor I

    Thank you for your response, let me try your suggestions. I use binary occurrence. 

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn

    Hello Martin

     

    It's not exactly clear why the OP wants to do this - but the technique definately works if you want to create a word list from an example set that was originally derived from a word list but which has been reduced in some way. The last book chapter I wrote did this extensively.

     

    regards

     

    Andrew

  • winecodingwinecoding Member Posts: 6 Contributor I

    Hi awchisholm,

     

    Thank you for the reply. I just have one question regarding this approach.

     

    I am saving the generated weight object into a csv file, and can keep the top 500 words, and make it as a text data file (each row represents a file) for Rapidminer to process. However, generating the wordlist object need the example set to have class information. The weight file itself does not have label information. The original training process is built for a six class categorization work. How can I solve this kind of discrepancy? 

Sign In or Register to comment.