Help with Categorical Data EDA

WI_NobleWI_Noble Member Posts: 5 Contributor II
edited November 2018 in Help
Hi,
Assume I have two columns of data, both binomial, such as “Cell Plan” (values: True, False) and “Churn” (values: Yes, No) .  I want calculate the 4-way matrix of values (Cell Plan=True, Churn=Yes; Cell Plan=False, Churn=Yes; Cell Plan=True, Churn=No; Cell Plan=False, Churn=No), both count and proportion.  What series of operators would I does this easily?  I’m wanting to do Exploratory Data Analysis on Categorical data in RM.
Thanks,

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    a quite generic solution would probably use the operators Aggregate and Pivot. I have generated a sample process for you named "Count co-occurences in matrix with Aggregate and Pivot" and uploaded it to myExperiment.org with our Community Extension.

    You can learn more about using the Community Extension and accessing uploaded processes at

    http://rapid-i.com/component/option,com_myblog/show,Video-on-RapidMiner-Community-Extension-myExperiment-.html/Itemid,172/

    Cheers,
    Ingo
  • WI_NobleWI_Noble Member Posts: 5 Contributor II
    Ingo,
    Thanks much for your reply and process.  I've looked at it, very helpful.

    I was working with a data set with an attribute named "Churn?".  After some analysis, I realized that the "?" in the attribute name was a problem to RM.  Could you give a list of the attribute naming rules that Rapidminer accepts or point to where such documentation exists?
    Thanks,
    Jonathan
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi Jonathan,

    thanks for your kind words.

    In principle, there is no restriction for attribute names, i.e. RapidMiner can basically use all possible characters. However, this might not be true for all operators. For example, operators which use regular expressions might be easily fooled by characters like ".", "*", or "?". The same is true for the built-in expression parser, which interprets characters like "(" or ")" as function indicators. In cases where such operators should work on attributes containing characters which are problematic for this operator, you will have to rename it first by using the operator "Rename". The error message should give you a hint which characters might have caused the problem.

    Since there are too many combinations of operators and hence it totally depends on the actual process, here is my recommendation: stick to letters, numbers, empty space, hyphen and underscrore and you should be fine in all cases. But please note that some operators unfortunately introduce "special" characters like "(" which might cause a problem for later operators and consider renaming then.

    Hope that helps,
    Ingo

  • WI_NobleWI_Noble Member Posts: 5 Contributor II
    Thank you, again, for the response.

    I did want to ask about plots for doing Exploratory Data Analysis (EDA) with categorical variables.  Obviously, there are many different plotting options in RM.  My goals, with a categorical label and attribute variables, are to show 2 things; the count of the values for an attribute variable (to see which value occurs most/least frequently) and to over-lay on the attribute the label count (to get a qualitative idea how predictive the attribute will be for the label). 

    The only two plots that come close to doing these 2 goals are the "Bar Stacked" and "Distribution" plots.  The "Bars Stacked" plot is especially helpful.  I took the "Generate Churn Data" operator and used this plot on the results.  With the "Group-By Column" = "Year 1" (or any other attribute), "Stack Column"="label", and  "Aggregation"="Count", I've come closest to my goals.

    However, I"m wondering what experts have done when doing visual EDA on categorical variables.

    Thanks much,
    Jonathan
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    you can preprocess the data with a simple process and include an Aggregate operator to count things.

    Greetings,
    Sebastian
Sign In or Register to comment.