How to turn a colored histogram into a stacked bar chart?

apoloduvalis · March 2019

I want to create a visualization from data with columns SEXO (sex, with 1 for male and 2 for female) and INGLABO (income, which is a range of four bins) with 24,984 examples. My goal is to show a column for male and other for female, where each segment of the bar be the count of examples of that sex within the appropriate income range.
My first choice was an histogram, with value column as SEXO (1,2) and color as INGLABO (the ranges are in the legend at the bottom of the chart). Despite there should be displayed 13,545 records for male (1) and 11,439 for female (2) in two single columns, each color (counts of values for a income range of a particular sex) is shown as a separated column, so you only get to see the more frequent group instead of visualizing each group of a same sex stacked upon each other.

Is this a bug? Or am I doing something wrong? The histogram seems to work fine without color.

I got closer to the visualization I need with a TurboPrep "Histogram color" Chart, but it downsized the sample from 24,984 records to 5,000, so it's kind of useless:

Any ideas? Thanks in advance,
Andrés

sgenzer · March 2019

hi @apoloduvalis ok thanks. Yes as @Marco_Boeck said the visualization engine in Turbo Prep is still the old one. I think (?) this is sort of what you're looking for?

Image: https://us.v-cdn.net/6030995/uploads/editor/6z/9pycl3fg3bfl.png

To do this, I used a simple process to aggregate the data:

<pre class="CodeBlock"><code><?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="read_csv" compatibility="9.2.001" expanded="true" height="68" name="Read CSV" width="90" x="45" y="34"> <parameter key="csv_file" value="occupied_men_women.csv"/> <parameter key="column_separators" value=";"/> <parameter key="trim_lines" value="false"/> <parameter key="use_quotes" value="true"/> <parameter key="quotes_character" value="""/> <parameter key="escape_character" value="\"/> <parameter key="skip_comments" value="true"/> <parameter key="comment_characters" value="#"/> <parameter key="starting_row" value="1"/> <parameter key="parse_numbers" value="true"/> <parameter key="decimal_character" value="."/> <parameter key="grouped_digits" value="false"/> <parameter key="grouping_character" value=","/> <parameter key="infinity_representation" value=""/> <parameter key="date_format" value=""/> <parameter key="first_row_as_names" value="true"/> <list key="annotations"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="encoding" value="UTF-8"/> <parameter key="read_all_values_as_polynominal" value="false"/> <list key="data_set_meta_data_information"> <parameter key="0" value="INGLABO.true.polynominal.attribute"/> <parameter key="1" value="DIRECTORIO.true.integer.attribute"/> <parameter key="2" value="SECUENCIA_P.true.integer.attribute"/> <parameter key="3" value="ORDEN.true.integer.attribute"/> <parameter key="4" value="HOGAR.true.integer.attribute"/> <parameter key="5" value="SEXO.true.integer.attribute"/> <parameter key="6" value="P6030S3.true.integer.attribute"/> <parameter key="7" value="EDAD.true.integer.attribute"/> <parameter key="8" value="NIVEL _EDUCATIVO_PRE.true.integer.attribute"/> <parameter key="9" value="P6210S1.true.integer.attribute"/> <parameter key="10" value="NIVEL _EDUCATIVO_GRAD.true.integer.attribute"/> <parameter key="11" value="CLASE.true.integer.attribute"/> <parameter key="12" value="AREA.true.integer.attribute"/> <parameter key="13" value="SUELDO.true.integer.attribute"/> <parameter key="14" value="DONDE_TRABAJA.true.integer.attribute"/> <parameter key="15" value="PENSION.true.integer.attribute"/> </list> <parameter key="read_not_matching_values_as_missings" value="false"/> <parameter key="datamanagement" value="double_array"/> <parameter key="data_management" value="auto"/> </operator> <operator activated="true" class="filter_examples" compatibility="9.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34"> <parameter key="parameter_expression" value=""/> <parameter key="condition_class" value="custom_filters"/> <parameter key="invert_filter" value="false"/> <list key="filters_list"> <parameter key="filters_entry_key" value="INGLABO.is_not_missing."/> <parameter key="filters_entry_key" value="SEXO.is_not_missing."/> </list> <parameter key="filters_logic_and" value="true"/> <parameter key="filters_check_metadata" value="true"/> </operator> <operator activated="true" class="aggregate" compatibility="9.2.001" expanded="true" height="82" name="Aggregate" width="90" x="313" y="34"> <parameter key="use_default_aggregation" value="false"/> <parameter key="attribute_filter_type" value="all"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="default_aggregation_function" value="average"/> <list key="aggregation_attributes"> <parameter key="INGLABO" value="count"/> </list> <parameter key="group_by_attributes" value="SEXO|INGLABO"/> <parameter key="count_all_combinations" value="false"/> <parameter key="only_distinct" value="false"/> <parameter key="ignore_missings" value="true"/> </operator> <connect from_op="Read CSV" from_port="output" to_op="Filter Examples" to_port="example set input"/> <connect from_op="Filter Examples" from_port="example set output" to_op="Aggregate" to_port="example set input"/> <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>

and you can actually now import my chart config by using this feature once you run the process:

Image: https://us.v-cdn.net/6030995/uploads/editor/v1/w0g9r6ogu6ec.png

The config is attached.

Scott

sgenzer · March 2019

hi @apoloduvalis I'm sorry no one has chimed in here. Can you post some data so I can try to see how a chart would look?

apoloduvalis · March 2019

Hi @sgenzer , thank you for showing up!
The colored histogram I am getting shows the different segments in a different color, but one behind the others instead of stacking them one upon each other. When I hover the pointer over the legend of the last segment (0, the black one) as if I was going to hide it, I can see the other segments behind it because they turned semi-transparent. I would like to attach an image but since I am still a newbie the system does not allow me to do it.

I started to think that the histogram not stacking the different colored segments was not a bug but a feature of RM. However, I got closer to the visualization I need with a TurboPrep "Histogram color" Chart, which with the same data managed to show a histogram with stacked segments. Sadly, it downsized the sample from 24,984 records to 5,000 (perhaps a limitation of the free version?), so it has not enough samples to include data for the red color (0) or to get the right proportions of the distribution.

I have attached the data I am using as source for the histogram as a .csv file. I hope you can reproduce this behavior and figure out what is going on.
Kind regards,
Andrés

apoloduvalis · March 2019

Well, it seems my second posts upgraded me from newbie to learner so now I can attach images. My histogram looks like this:

Image: https://us.v-cdn.net/6030995/uploads/editor/af/wbobz4o8wt7i.png

And the TurboPrep histogram color chart looks like this:

Image: https://us.v-cdn.net/6030995/uploads/editor/na/4wltf2lbni1p.png

Marco_Boeck · March 2019

Hi,

Currently you would need to pivot the data manually first, and then use a "Column" chart with stacking. Histograms in the new visualizations are not stacking the data on purpose, as they work on numerical data, and usually the bins between different colors do not have anywhere close to a 100% match.
We will have a "Color Group" option coming for Line/Bar/Column/Area charts coming at some point in the future, where you can basically split the data into different groups (per category value), and at that point you can very easily do what you are currently trying to achieve.

Turbo Prep still uses the old charts, which is why it's a bit different story there.

Regards,
Marco

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to turn a colored histogram into a stacked bar chart?

Best Answer

Answers