Help with Categorical Data EDA

WI_Noble · January 2011

Hi,
Assume I have two columns of data, both binomial, such as “Cell Plan” (values: True, False) and “Churn” (values: Yes, No) . I want calculate the 4-way matrix of values (Cell Plan=True, Churn=Yes; Cell Plan=False, Churn=Yes; Cell Plan=True, Churn=No; Cell Plan=False, Churn=No), both count and proportion. What series of operators would I does this easily? I’m wanting to do Exploratory Data Analysis on Categorical data in RM.
Thanks,

IngoRM · January 2011

Hi,

a quite generic solution would probably use the operators Aggregate and Pivot. I have generated a sample process for you named "Count co-occurences in matrix with Aggregate and Pivot" and uploaded it to myExperiment.org with our Community Extension.

You can learn more about using the Community Extension and accessing uploaded processes at

http://rapid-i.com/component/option,com_myblog/show,Video-on-RapidMiner-Community-Extension-myExperiment-.html/Itemid,172/

Cheers,
Ingo

WI_Noble · January 2011

Ingo,
Thanks much for your reply and process. I've looked at it, very helpful.

I was working with a data set with an attribute named "Churn?". After some analysis, I realized that the "?" in the attribute name was a problem to RM. Could you give a list of the attribute naming rules that Rapidminer accepts or point to where such documentation exists?
Thanks,
Jonathan

IngoRM · January 2011

Hi Jonathan,

thanks for your kind words.

In principle, there is no restriction for attribute names, i.e. RapidMiner can basically use all possible characters. However, this might not be true for all operators. For example, operators which use regular expressions might be easily fooled by characters like ".", "*", or "?". The same is true for the built-in expression parser, which interprets characters like "(" or ")" as function indicators. In cases where such operators should work on attributes containing characters which are problematic for this operator, you will have to rename it first by using the operator "Rename". The error message should give you a hint which characters might have caused the problem.

Since there are too many combinations of operators and hence it totally depends on the actual process, here is my recommendation: stick to letters, numbers, empty space, hyphen and underscrore and you should be fine in all cases. But please note that some operators unfortunately introduce "special" characters like "(" which might cause a problem for later operators and consider renaming then.

Hope that helps,
Ingo

WI_Noble · January 2011

Thank you, again, for the response.

I did want to ask about plots for doing Exploratory Data Analysis (EDA) with categorical variables. Obviously, there are many different plotting options in RM. My goals, with a categorical label and attribute variables, are to show 2 things; the count of the values for an attribute variable (to see which value occurs most/least frequently) and to over-lay on the attribute the label count (to get a qualitative idea how predictive the attribute will be for the label).

The only two plots that come close to doing these 2 goals are the "Bar Stacked" and "Distribution" plots. The "Bars Stacked" plot is especially helpful. I took the "Generate Churn Data" operator and used this plot on the results. With the "Group-By Column" = "Year 1" (or any other attribute), "Stack Column"="label", and "Aggregation"="Count", I've come closest to my goals.

However, I"m wondering what experts have done when doing visual EDA on categorical variables.

Thanks much,
Jonathan

land · February 2011

Hi,
you can preprocess the data with a simple process and include an Aggregate operator to count things.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Help with Categorical Data EDA

Answers