Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
How to turn a colored histogram into a stacked bar chart?
apoloduvalis
Member Posts: 3 Learner I
I want to create a visualization from data with columns SEXO (sex, with 1 for male and 2 for female) and INGLABO (income, which is a range of four bins) with 24,984 examples. My goal is to show a column for male and other for female, where each segment of the bar be the count of examples of that sex within the appropriate income range.
My first choice was an histogram, with value column as SEXO (1,2) and color as INGLABO (the ranges are in the legend at the bottom of the chart). Despite there should be displayed 13,545 records for male (1) and 11,439 for female (2) in two single columns, each color (counts of values for a income range of a particular sex) is shown as a separated column, so you only get to see the more frequent group instead of visualizing each group of a same sex stacked upon each other.
Is this a bug? Or am I doing something wrong? The histogram seems to work fine without color.
I got closer to the visualization I need with a TurboPrep "Histogram color" Chart, but it downsized the sample from 24,984 records to 5,000, so it's kind of useless:
Any ideas? Thanks in advance,
Andrés
My first choice was an histogram, with value column as SEXO (1,2) and color as INGLABO (the ranges are in the legend at the bottom of the chart). Despite there should be displayed 13,545 records for male (1) and 11,439 for female (2) in two single columns, each color (counts of values for a income range of a particular sex) is shown as a separated column, so you only get to see the more frequent group instead of visualizing each group of a same sex stacked upon each other.
Is this a bug? Or am I doing something wrong? The histogram seems to work fine without color.
I got closer to the visualization I need with a TurboPrep "Histogram color" Chart, but it downsized the sample from 24,984 records to 5,000, so it's kind of useless:
Any ideas? Thanks in advance,
Andrés
0
Best Answer
-
sgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Managerhi @apoloduvalis ok thanks. Yes as @Marco_Boeck said the visualization engine in Turbo Prep is still the old one. I think (?) this is sort of what you're looking for?
To do this, I used a simple process to aggregate the data:<pre class="CodeBlock"><code>
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="read_csv" compatibility="9.2.001" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34"> <parameter key="csv_file" value="occupied_men_women.csv"/> <parameter key="column_separators" value=";"/> <parameter key="trim_lines" value="false"/> <parameter key="use_quotes" value="true"/> <parameter key="quotes_character" value="""/> <parameter key="escape_character" value="\"/> <parameter key="skip_comments" value="true"/> <parameter key="comment_characters" value="#"/> <parameter key="starting_row" value="1"/> <parameter key="parse_numbers" value="true"/> <parameter key="decimal_character" value="."/> <parameter key="grouped_digits" value="false"/> <parameter key="grouping_character" value=","/> <parameter key="infinity_representation" value=""/> <parameter key="date_format" value=""/> <parameter key="first_row_as_names" value="true"/> <list key="annotations"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="encoding" value="UTF-8"/> <parameter key="read_all_values_as_polynominal" value="false"/> <list key="data_set_meta_data_information"> <parameter key="0" value="INGLABO.true.polynominal.attribute"/> <parameter key="1" value="DIRECTORIO.true.integer.attribute"/> <parameter key="2" value="SECUENCIA_P.true.integer.attribute"/> <parameter key="3" value="ORDEN.true.integer.attribute"/> <parameter key="4" value="HOGAR.true.integer.attribute"/> <parameter key="5" value="SEXO.true.integer.attribute"/> <parameter key="6" value="P6030S3.true.integer.attribute"/> <parameter key="7" value="EDAD.true.integer.attribute"/> <parameter key="8" value="NIVEL _EDUCATIVO_PRE.true.integer.attribute"/> <parameter key="9" value="P6210S1.true.integer.attribute"/> <parameter key="10" value="NIVEL _EDUCATIVO_GRAD.true.integer.attribute"/> <parameter key="11" value="CLASE.true.integer.attribute"/> <parameter key="12" value="AREA.true.integer.attribute"/> <parameter key="13" value="SUELDO.true.integer.attribute"/> <parameter key="14" value="DONDE_TRABAJA.true.integer.attribute"/> <parameter key="15" value="PENSION.true.integer.attribute"/> </list> <parameter key="read_not_matching_values_as_missings" value="false"/> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> </operator> <operator activated="true" class="filter_examples" compatibility="9.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34"> <parameter key="parameter_expression" value=""/> <parameter key="condition_class" value="custom_filters"/> <parameter key="invert_filter" value="false"/> <list key="filters_list"> <parameter key="filters_entry_key" value="INGLABO.is_not_missing."/> <parameter key="filters_entry_key" value="SEXO.is_not_missing."/> </list> <parameter key="filters_logic_and" value="true"/> <parameter key="filters_check_metadata" value="true"/> </operator> <operator activated="true" class="aggregate" compatibility="9.2.001" expanded="true" height="82" name="Aggregate" width="90" x="313" y="34"> <parameter key="use_default_aggregation" value="false"/> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="default_aggregation_function" value="average"/> <list key="aggregation_attributes"> <parameter key="INGLABO" value="count"/> </list> <parameter key="group_by_attributes" value="SEXO|INGLABO"/> <parameter key="count_all_combinations" value="false"/> <parameter key="only_distinct" value="false"/> <parameter key="ignore_missings" value="true"/> </operator> <connect from_op="Read CSV" from_port="output" to_op="Filter Examples" to_port="example set input"/> <connect from_op="Filter Examples" from_port="example set output" to_op="Aggregate" to_port="example set input"/> <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
and you can actually now import my chart config by using this feature once you run the process:
The config is attached.
Scott
5
Answers
The colored histogram I am getting shows the different segments in a different color, but one behind the others instead of stacking them one upon each other. When I hover the pointer over the legend of the last segment (0, the black one) as if I was going to hide it, I can see the other segments behind it because they turned semi-transparent. I would like to attach an image but since I am still a newbie the system does not allow me to do it.
I started to think that the histogram not stacking the different colored segments was not a bug but a feature of RM. However, I got closer to the visualization I need with a TurboPrep "Histogram color" Chart, which with the same data managed to show a histogram with stacked segments. Sadly, it downsized the sample from 24,984 records to 5,000 (perhaps a limitation of the free version?), so it has not enough samples to include data for the red color (0) or to get the right proportions of the distribution.
I have attached the data I am using as source for the histogram as a .csv file. I hope you can reproduce this behavior and figure out what is going on.
Kind regards,
Andrés
And the TurboPrep histogram color chart looks like this:
Currently you would need to pivot the data manually first, and then use a "Column" chart with stacking. Histograms in the new visualizations are not stacking the data on purpose, as they work on numerical data, and usually the bins between different colors do not have anywhere close to a 100% match.
We will have a "Color Group" option coming for Line/Bar/Column/Area charts coming at some point in the future, where you can basically split the data into different groups (per category value), and at that point you can very easily do what you are currently trying to achieve.
Turbo Prep still uses the old charts, which is why it's a bit different story there.
Regards,
Marco