Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
How to generate a wordcloud for each class attribute
Hello, I am working on a project for university and was wondering if anyone could share their expertise as i am a beginner to rapidminer.
It is a pretty straightforward project, I got a chatGPT sentiment dataset from kaggle which labels tweets as either good, bad, or neutral, and want to perform classification analysis. The only thing that I can't seem to figure out is how to generate a wordcloud for the most occuring terms. For example, I want to see what the top 20 terms that make up each sentiment label.
I would greatly appreciate any input!
Thanks
Attached is a sample from the data, and this is the XML:
It is a pretty straightforward project, I got a chatGPT sentiment dataset from kaggle which labels tweets as either good, bad, or neutral, and want to perform classification analysis. The only thing that I can't seem to figure out is how to generate a wordcloud for the most occuring terms. For example, I want to see what the top 20 terms that make up each sentiment label.
I would greatly appreciate any input!
Thanks
Attached is a sample from the data, and this is the XML:
<?xml version="1.0" encoding="UTF-8"?><process version="10.3.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="10.3.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="open_file" compatibility="10.3.001" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
<parameter key="resource_type" value="file"/>
<parameter key="filename" value="C:/Users/Admin/Downloads/batch2.xlsx"/>
</operator>
<operator activated="true" class="read_excel" compatibility="10.3.001" expanded="true" height="68" name="Read Excel" width="90" x="179" y="34">
<parameter key="excel_file" value="C:\Users\Admin\Downloads\batch2.xlsx"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="use_header_row" value="true"/>
<parameter key="header_row" value="1"/>
<parameter key="first_row_as_names" value="true"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="A.true.integer.attribute"/>
<parameter key="1" value="tweets.true.polynominal.attribute"/>
<parameter key="2" value="labels.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="false"/>
</operator>
<operator activated="true" class="blending:set_role" compatibility="10.3.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
<list key="set_roles">
<parameter key="labels" value="label"/>
</list>
</operator>
<operator activated="true" class="filter_examples" compatibility="10.3.001" expanded="true" height="103" name="Filter Examples" width="90" x="447" y="34">
<parameter key="parameter_string" value="labels=good"/>
<parameter key="parameter_expression" value=""/>
<parameter key="condition_class" value="custom_filters"/>
<parameter key="invert_filter" value="false"/>
<list key="filters_list">
<parameter key="filters_entry_key" value="labels.equals.bad"/>
</list>
<parameter key="filters_logic_and" value="true"/>
<parameter key="filters_check_metadata" value="true"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="10.3.001" expanded="true" height="82" name="Nominal to Text" width="90" x="581" y="34">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value="labels"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="10.0.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="715" y="34">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="false"/>
<parameter key="prune_method" value="none"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="10.0.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=","/>
<parameter key="expression" value="."/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:filter_by_length" compatibility="10.0.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="246" y="34">
<parameter key="min_chars" value="4"/>
<parameter key="max_chars" value="25"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="10.0.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="34">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="10.0.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="34"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="10.0.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="648" y="34">
<parameter key="file" value="C:/Users/Admin/Downloads/stopwords.txt"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="encoding" value="UTF-8"/>
</operator>
<operator activated="true" class="text:stem_porter" compatibility="10.0.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="782" y="34"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
<connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="aggregate" compatibility="10.3.001" expanded="true" height="82" name="Aggregate" width="90" x="45" y="136">
<parameter key="use_default_aggregation" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="default_aggregation_function" value="average"/>
<list key="aggregation_attributes">
<parameter key="labels" value="count"/>
</list>
<parameter key="group_by_attributes" value=""/>
<parameter key="count_all_combinations" value="false"/>
<parameter key="only_distinct" value="false"/>
<parameter key="ignore_missings" value="true"/>
</operator>
<operator activated="true" class="blending:sort" compatibility="10.3.001" expanded="true" height="82" name="Sort" width="90" x="179" y="136">
<list key="sort_by">
<parameter key="count(labels)" value="ascending"/>
</list>
</operator>
<operator activated="true" class="filter_example_range" compatibility="10.3.001" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="136">
<parameter key="first_example" value="-1"/>
<parameter key="last_example" value="20"/>
<parameter key="invert_filter" value="false"/>
</operator>
<connect from_op="Open File" from_port="file" to_op="Read Excel" to_port="file"/>
<connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0
Best Answer
-
rjones13 Member Posts: 199 UnicornHi @tarekq,
The weight can be added on the visualisation in the plot controls.
To overcome the limitation of up to 500 examples, I've added in a Sort operator to organise the list into descending by "Total" and then Filter Example Range to only take the first 500. It might need some more filtering, just as it seems quite dominated by https, chatgpt, and openai. Although I realise you might have already filtered these out as I took out the custom stopwords operator in my version.
Any questions let me know.
Best,
Roland<?xml version="1.0" encoding="UTF-8"?><process version="10.3.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="10.3.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="open_file" compatibility="10.3.001" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
<parameter key="resource_type" value="file"/>
<parameter key="filename" value="C:/Users/rjones/Downloads/Book1.xlsx"/>
</operator>
<operator activated="true" class="read_excel" compatibility="10.3.001" expanded="true" height="68" name="Read Excel" width="90" x="179" y="34">
<parameter key="excel_file" value="C:\Users\Admin\Downloads\batch2.xlsx"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="use_header_row" value="true"/>
<parameter key="header_row" value="1"/>
<parameter key="first_row_as_names" value="true"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="A.true.integer.attribute"/>
<parameter key="1" value="tweets.true.polynominal.attribute"/>
<parameter key="2" value="labels.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="false"/>
</operator>
<operator activated="true" class="blending:set_role" compatibility="10.3.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
<list key="set_roles">
<parameter key="labels" value="label"/>
</list>
</operator>
<operator activated="true" class="filter_examples" compatibility="10.3.001" expanded="true" height="103" name="Filter Examples" width="90" x="447" y="34">
<parameter key="parameter_string" value="labels=good"/>
<parameter key="parameter_expression" value=""/>
<parameter key="condition_class" value="custom_filters"/>
<parameter key="invert_filter" value="false"/>
<list key="filters_list">
<parameter key="filters_entry_key" value="labels.equals.bad"/>
</list>
<parameter key="filters_logic_and" value="true"/>
<parameter key="filters_check_metadata" value="true"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="10.3.001" expanded="true" height="82" name="Nominal to Text" width="90" x="581" y="34">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value="labels"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="10.0.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="715" y="34">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="false"/>
<parameter key="prune_method" value="none"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="10.0.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=","/>
<parameter key="expression" value="."/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:filter_by_length" compatibility="10.0.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="246" y="34">
<parameter key="min_chars" value="4"/>
<parameter key="max_chars" value="25"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="10.0.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="34">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="10.0.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="34"/>
<operator activated="false" class="text:filter_stopwords_dictionary" compatibility="10.0.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="648" y="34">
<parameter key="file" value="C:/Users/Admin/Downloads/stopwords.txt"/>
<parameter key="case_sensitive" value="false"/>
<parameter key="encoding" value="UTF-8"/>
</operator>
<operator activated="false" class="text:stem_porter" compatibility="10.0.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="782" y="34"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:wordlist_to_data" compatibility="10.0.000" expanded="true" height="82" name="WordList to Data" width="90" x="782" y="136"/>
<operator activated="true" class="blending:sort" compatibility="10.3.001" expanded="true" height="82" name="Sort (2)" width="90" x="916" y="136">
<list key="sort_by">
<parameter key="total" value="descending"/>
</list>
</operator>
<operator activated="true" class="filter_example_range" compatibility="10.3.001" expanded="true" height="82" name="Filter Example Range (2)" width="90" x="1050" y="136">
<parameter key="first_example" value="1"/>
<parameter key="last_example" value="500"/>
<parameter key="invert_filter" value="false"/>
</operator>
<operator activated="true" class="aggregate" compatibility="10.3.001" expanded="true" height="82" name="Aggregate" width="90" x="45" y="136">
<parameter key="use_default_aggregation" value="false"/>
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="default_aggregation_function" value="average"/>
<list key="aggregation_attributes">
<parameter key="labels" value="count"/>
</list>
<parameter key="group_by_attributes" value=""/>
<parameter key="count_all_combinations" value="false"/>
<parameter key="only_distinct" value="false"/>
<parameter key="ignore_missings" value="true"/>
</operator>
<operator activated="true" class="blending:sort" compatibility="10.3.001" expanded="true" height="82" name="Sort" width="90" x="179" y="136">
<list key="sort_by">
<parameter key="count(labels)" value="ascending"/>
</list>
</operator>
<operator activated="true" class="filter_example_range" compatibility="10.3.001" expanded="true" height="82" name="Filter Example Range" width="90" x="313" y="136">
<parameter key="first_example" value="-1"/>
<parameter key="last_example" value="20"/>
<parameter key="invert_filter" value="false"/>
</operator>
<connect from_op="Open File" from_port="file" to_op="Read Excel" to_port="file"/>
<connect from_op="Read Excel" from_port="output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Aggregate" to_port="example set input"/>
<connect from_op="Process Documents from Data" from_port="word list" to_op="WordList to Data" to_port="word list"/>
<connect from_op="WordList to Data" from_port="example set" to_op="Sort (2)" to_port="example set input"/>
<connect from_op="Sort (2)" from_port="example set output" to_op="Filter Example Range (2)" to_port="example set input"/>
<connect from_op="Filter Example Range (2)" from_port="example set output" to_port="result 2"/>
<connect from_op="Aggregate" from_port="example set output" to_op="Sort" to_port="example set input"/>
<connect from_op="Sort" from_port="example set output" to_op="Filter Example Range" to_port="example set input"/>
<connect from_op="Filter Example Range" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>1
Answers
You seem to be along the right lines. The only thing you'll need to do is add a WordList to Data operator to convert the word list output of Process Documents into an ExampleSet. You can then use this to plot a word cloud:
There are further options if you'd like to add weight, and if you only want the top 20 then you might need to add a filter. Any questions let me know. I've attached the modified XML below (disabled custom stopwords and stemming, and also changed file paths).
Best,
Roland
Thanks so much for this, I was really struggling!
Thing is, I would love to know how to add weight, so I can show which terms are the most occurring in the dataset that I have since it is massive. I tried the Filter Example Range Operator, but it does not seem to be working for me, and the wordcloud only supports 500 examples.
How would I do it for the attached dataset? It is only a sample of my dataset, but it is a little bit larger than the original one.
Book1.xlsx
Again, thanks a bunch Roland, I appreciate it so much!
Tarek.
Roland, you are truly a lifesaver. You just singlehandedly saved my university project
Thanks so much man, wishing you all the best.
Best,
Tarek.