The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

Scoring text against a semantic dictionary

robinrobin Member Posts: 100 Guru
edited November 2018 in Help

Hi

 

I am attempting to run a process where I score text against a semantic dictionary (I have attached an example), I am pulling the word count out as well as loading the dictionary and then getting myself in a complete knot as to the calculation of the result. 

 

The process should be as follows:

  1. Load the text
  2. Select the text
  3. Load the dictionary
  4. Perform a word frequency count
  5. Multiply the word frequency by the dictionary result and divide by the total word count

Semantic dictionaries are different to sentiment dicitonaries as they calculate the volume of word types being used in the text. Also in a semantic dictionary a word can belong to multiple segments and is not only either positive or negative. 

 

I have also attached the process as far as I have managed to get so far. 

Best Answer

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    Solution Accepted

    Take a look at the sample process below. You'll need to modify it with your dictionary and have your pronouns, verbs, etc all add up to one. This process was developed by my colleague, so he annotated the process as well to help you along. 

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="subprocess" compatibility="7.4.000" expanded="true" height="82" name="Subprocess" width="90" x="45" y="136">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="Text" value="&quot;good&quot;"/>
    <parameter key="Weight" value="1"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="45" y="136">
    <list key="attribute_values">
    <parameter key="Text" value="&quot;bad&quot;"/>
    <parameter key="Weight" value="-1.5"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="7.4.000" expanded="true" height="103" name="Append" width="90" x="179" y="34"/>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Append" from_port="merged set" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Create a dummy dictionary</description>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.4.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="179" y="136">
    <list key="function_descriptions">
    <parameter key="Weight" value="1/Weight"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">Invert all Weights for the Linear Regression</description>
    </operator>
    <operator activated="true" class="generate_id" compatibility="7.4.000" expanded="true" height="82" name="Generate ID" width="90" x="313" y="136"/>
    <operator activated="true" class="pivot" compatibility="7.4.000" expanded="true" height="82" name="Pivot" width="90" x="447" y="136">
    <parameter key="group_attribute" value="id"/>
    <parameter key="index_attribute" value="Text"/>
    <parameter key="skip_constant_attributes" value="false"/>
    </operator>
    <operator activated="true" class="rename_by_replacing" compatibility="7.4.000" expanded="true" height="82" name="Rename by Replacing" width="90" x="581" y="136">
    <parameter key="replace_what" value="Weight_(.+)"/>
    <parameter key="replace_by" value="$1"/>
    </operator>
    <operator activated="true" class="replace_missing_values" compatibility="7.4.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="715" y="136">
    <parameter key="default" value="zero"/>
    <list key="columns"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="849" y="136">
    <list key="function_descriptions">
    <parameter key="label" value="1"/>
    </list>
    </operator>
    <operator activated="true" class="set_role" compatibility="7.4.000" expanded="true" height="82" name="Set Role" width="90" x="983" y="136">
    <parameter key="attribute_name" value="label"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="vector_linear_regression" compatibility="7.4.000" expanded="true" height="82" name="Vector Linear Regression" width="90" x="1117" y="136">
    <parameter key="use_bias" value="false"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="1050" y="340">
    <list key="attribute_values">
    <parameter key="Text" value="&quot;This is a good Text&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification (4)" width="90" x="1050" y="442">
    <list key="attribute_values">
    <parameter key="Text" value="&quot;This is a bad Text&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="7.4.000" expanded="true" height="103" name="Append (2)" width="90" x="1184" y="391"/>
    <operator activated="true" class="nominal_to_text" compatibility="7.4.000" expanded="true" height="82" name="Nominal to Text" width="90" x="1318" y="391">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.4.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="1452" y="391">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize" width="90" x="179" y="85"/>
    <operator activated="true" class="text:transform_cases" compatibility="7.4.001" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="85"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="apply_model" compatibility="7.4.000" expanded="true" height="82" name="Apply Model" width="90" x="1586" y="136">
    <list key="application_parameters"/>
    </operator>
    <connect from_op="Subprocess" from_port="out 1" to_op="Generate Attributes (2)" to_port="example set input"/>
    <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Generate ID" to_port="example set input"/>
    <connect from_op="Generate ID" from_port="example set output" to_op="Pivot" to_port="example set input"/>
    <connect from_op="Pivot" from_port="example set output" to_op="Rename by Replacing" to_port="example set input"/>
    <connect from_op="Rename by Replacing" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
    <connect from_op="Replace Missing Values" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Vector Linear Regression" to_port="training set"/>
    <connect from_op="Vector Linear Regression" from_port="model" to_op="Apply Model" to_port="model"/>
    <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append (2)" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (4)" from_port="output" to_op="Append (2)" to_port="example set 2"/>
    <connect from_op="Append (2)" from_port="merged set" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
    <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <description align="left" color="yellow" colored="false" height="248" resized="true" width="543" x="281" y="32">Built a table like&lt;br&gt;&lt;br&gt;good ................. bad&lt;br&gt;1/1 ..................... 0&lt;br&gt;0 ......................... 1/-1.5</description>
    <description align="center" color="yellow" colored="false" height="247" resized="true" width="265" x="845" y="12">Generate a constant label of 1</description>
    <description align="center" color="yellow" colored="false" height="271" resized="true" width="578" x="1017" y="287">Build and process test data</description>
    <description align="center" color="red" colored="true" height="140" resized="true" width="706" x="55" y="423">This process creates a scoring model from an annotated dictionary.&lt;br/&gt;This is a technique used e.g. for sentiment analysis where you assign a value for each word. In this case we have a dummy data set with &amp;quot;good&amp;quot; and &amp;quot;bad&amp;quot; annotated with 1 and -1.5 respectivly.</description>
    </process>
    </operator>
    </process>

Answers

  • robinrobin Member Posts: 100 Guru

    Forgot the text.

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    I'm a tad bit unclear as to where you want to do the processing of the dictionary.csv file. Do you want to do while tokenization is happening (i.e. word vector creation) or do you want to do it outside?  The way I understand is that you wish to do it after you created a WordList.  Check out this attached process. I used a Search Twitter operator just for some random text. 

     

    Also check out this KB article: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/How-to-Build-a-Dictionary-Based-Sentiment-Model-in-RapidMiner/ta-p/36067

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <operator activated="true" class="open_file" compatibility="7.4.000" expanded="true" height="68" name="Open File" width="90" x="246" y="136">
    <parameter key="resource_type" value="file"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <operator activated="true" class="read_csv" compatibility="7.4.000" expanded="true" height="68" name="Read CSV" width="90" x="380" y="136">
    <parameter key="csv_file" value="C:\Users\ThomasOtt\Downloads\dictionary.csv"/>
    <parameter key="column_separators" value=";"/>
    <parameter key="trim_lines" value="false"/>
    <parameter key="use_quotes" value="true"/>
    <parameter key="quotes_character" value="&quot;"/>
    <parameter key="escape_character" value="\"/>
    <parameter key="skip_comments" value="false"/>
    <parameter key="comment_characters" value="#"/>
    <parameter key="parse_numbers" value="true"/>
    <parameter key="decimal_character" value="."/>
    <parameter key="grouped_digits" value="false"/>
    <parameter key="grouping_character" value=","/>
    <parameter key="date_format" value=""/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <parameter key="time_zone" value="SYSTEM"/>
    <parameter key="locale" value="English (United States)"/>
    <parameter key="encoding" value="UTF-8"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="word.true.polynominal.attribute"/>
    <parameter key="1" value="noun.true.integer.attribute"/>
    <parameter key="2" value="verb.true.integer.attribute"/>
    <parameter key="3" value="preposition.true.integer.attribute"/>
    </list>
    <parameter key="read_not_matching_values_as_missings" value="true"/>
    <parameter key="datamanagement" value="double_array"/>
    <parameter key="data_management" value="auto"/>
    </operator>
    </process>

     

     

  • robinrobin Member Posts: 100 Guru

    Hi Thomas

     

    I would look to do the scoring outside of the tokenisation. 

     

    If you look at the start of my process it is very similar to the Twitter process that you linked to the file. The difference between a sentiment dictionary and a semantic dictionary would be something like this: 

     

    Sentiment

    Word        Weight

    good 1.0
    bad -1.5

     

    Semantic

    Text Catagory   Verb   Noun   Article    Conjunction     Other

    Text 1 10 15 30 5 40
    Text 2 20 10 14 1 55
    Text 3 15 5 10 5 65

     

    The semantic dictionary should always total 100% as it classifies all of the words used in the text under their various parts. It is not a decision if there is good or bad within the text, it is a classification of all of the words inside the text into taxonomies. One word can belong to multiple taxonomies such as "I" is a pro-noun as well as a personal pronoun and would need to be allocated to both categories in the analysis. Using the "I" example, if this was the only text present it would look something like this: 

     

    Text Catagory   Function Word     Pronoun     Personal Pronoun  Other
    Text 1 25 25 25 25

     

    I have not supplied a full dictionary but only a partial one as I am still busy building the scorings for each of the words.  

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Oh Ok, so then the process is nearly there. Do your Tokenization and then do a Wordlist to Data from there you would need to build a subprocess to do the last bit. 

     

    Let me see if I can whip something up.

  • robinrobin Member Posts: 100 Guru

    HI Thomas, I installed Java across my keyboard and lost my Mac for two weeks. Quick follow up on this post. 

     

    I believe I am not proficient enough as yet in the usage of the subprocesses so would like to see how you tackle it. 

Sign In or Register to comment.