Question regarding generating Decision tree using RapidMiner tool

toure123toure123 Member Posts: 8 Contributor II
edited November 2018 in Help
Hello everyone,

I have a question regarding on how to properly generate a decision tree using rapid miner tool. This question is related on picking the right label attribute, and as well as on how to actually generate a tree which makes sense? I've got a specific data set which I load into the "Read excel" operator, pick the label attribute I want, which I connect then to the  "Decision tree" operator in rapid miner. This is how it looks like in the end:

image

But the resulting decission tree is either too small, too big or its not showing at all what I wanted to represent it... Is there any way that I can "force" the algorithm to branch off each time on specific column I tell it to? Something like this:

image

If the outlook is overcast, person X will play golf. If the outlook is rain, but if its windy, person X won't play golf, otherwise person X will play golf.

I'm quite new with data mining, and every explanation would be really nice on how can I generate a proper decision tree that will actually look like something that is readable...

Thanks a lot!
Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    i think you need to get more into the thinking of data mining. As a data scientist your are not necessarly intersted in how this tree looks like (sometimes you are, but on a rather high level).
    So what you would do is run a validation, calculate a performance measure and optimize the parameters of the tree to get the best results.

    I would recommend our getting started tutorials toyou: http://docs.rapidminer.com/studio/getting-started/

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • toure123toure123 Member Posts: 8 Contributor II
    Martin Schmitz wrote:

    Hi,

    i think you need to get more into the thinking of data mining. As a data scientist your are not necessarly intersted in how this tree looks like (sometimes you are, but on a rather high level).
    So what you would do is run a validation, calculate a performance measure and optimize the parameters of the tree to get the best results.

    I would recommend our getting started tutorials toyou: http://docs.rapidminer.com/studio/getting-started/

    ~Martin
    Thank you for the reply Martin. I couldn't agree that I'm not interested in how the tree looks like, because whenever I'd get the resulting decision tree, the results need to be explained properly, or at least readable to the person who's looking at the tree. I've noticed that aside of label type of attribute, there are numerous more types like id, base_value, what are those for?

    Best regards

    P.S. Does the "Decision tree" operator includes algorithms within itself? I'd like to use the CART algorithm to generate the tree, but I'm not exactly sure how to do that...
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    about the lookalike: Well - as I said as a data scientist you are interested in performance, not necessarly in understandablity. Most advanced algorithms (SVM, Neural Net, Random Forest) are hard to represent at all. Your way is a rather explorative way, which is fine, but a bit different in thinking. The explorative phase might be something you do before you start actual modelling.

    On the types: The types are called role. Every coloum can have one role. An id is used e.g. in Joining, a Cluster is a result of cluster algorithm. You can by the way type in any word in that field. The result is a special attributes. Those are useful, because all special attributes are ignored by operators unless you either specifically tell them to use them (use special attributes) or they need them to do their job (label for learners).

    On CART: The standard RapidMiner Decision Tree is a own implementation. I think it is close to CART if you only use numerical values or something. If you want to have a "real" cart, you need to use the Weka package. There is W-SimpleCart as well as W-J4.8, which is the C4.5 implementation.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • toure123toure123 Member Posts: 8 Contributor II
    Okay thank you very much on the explanation Martin. I've one final question regarding the RapidMiner tool itself. Which of these attributes should I pick to tell the decision tree, okay this is the attribute from which I want to start branching off my tree. The attributes listed in my RapidMiner are:

    - attribute
    - label
    - id
    - weight
    - batch
    - cluster
    - prediction
    - outlier
    - cost
    - base_value

    I understand that label colum is the way of saying to RapidMiner that that's the column that I wanna build the model to and understand. But how to tell RapidMiner what's the attribute from which I want to start branching off? This is what confuses me mostly...
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    you cannot. The decision tree picks his split on it's own. Thats the learning.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • toure123toure123 Member Posts: 8 Contributor II
    Martin Schmitz wrote:

    you cannot. The decision tree picks his split on it's own. Thats the learning.
    Okay that's quite new for me... Can you tell me from your experience, what is the best attribute to pick on which the model will be built ? Is it the attribute which has the most number of different data? Like Grades, or Countries? The more values attribute has, the better it is to pick it? I've a attribute which basically represents the persons sex, male or female, and when I choose that attribute as a label, the resulting decision tree is:

    image

    This isn't a tree, it's a rectangle with a label "M" inside it lol...

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    In this case the tree did not find any suitable split. Try to lower the minimal gain. This how good a cut needs to be to be taken. The standard RM option is kind of restrictive. Try 0.001.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • toure123toure123 Member Posts: 8 Contributor II
    Martin Schmitz wrote:

    In this case the tree did not find any suitable split. Try to lower the minimal gain. This how good a cut needs to be to be taken. The standard RM option is kind of restrictive. Try 0.001.

    ~Martin
    Yes now I've got something... which I'm not sure what it is... But as I get this decision tree, is it a valid one? I mean the algorithm presents the data correctly?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    It depends on what you call correctly.

    As a data scientist i would measure how accurate my model is. But keep in mind: It is a big difference to describe data well or predict future data.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • toure123toure123 Member Posts: 8 Contributor II
    Martin Schmitz wrote:

    It depends on what you call correctly.

    As a data scientist i would measure how accurate my model is. But keep in mind: It is a big difference to describe data well or predict future data.

    ~Martin
    I'm actually trying to describe data as best as I can and represent it visually then. That's how I'd like my model to look like. What parameters do I need to take in consideration when I do that?

    Theres criterion: information gain, gini indeks, gain ratio, accuracy? I guess I'd need to pick accuracy in my case then?
  • haddockhaddock Member Posts: 849 Maven
    Hi,
    I'm actually trying to describe data as best as I can and represent it visually then
    Sounds to me like you actually want a drawing tool, have you tried Gliffy diagrams?

    H
  • toure123toure123 Member Posts: 8 Contributor II
    haddock wrote:

    Hi,

    Sounds to me like you actually want a drawing tool, have you tried Gliffy diagrams?

    H
    I will check it out ty :)...

    PS: Just one more question guys... What algorithms does the regular "Decision tree" operator in RapidMiner uses?  I've read somewhere that it's a combination of CHAID and ID3 algorithms?

    Thanks a lot once again! :)
Sign In or Register to comment.