RM Decision Trees, Adaboost

Legacy UserLegacy User Member Posts: 0 Newbie
edited November 2018 in Help
Random question on how decision trees work in rapidminer. I'm running a decision tree for a predictive model and at the moment just splitting my dataset into 80% train/ 20% test. It's a polynomial classification problem with numerica and nominal attributes. 2 questions:

1) When I run a single decision tree with the % split validation operator, how come it runs the decision tree training twice? I'm just looking at the log and it runs it once, then I see validation still running and a [2] Decision Tree in the log.

2) When I use adaboost to boost the decision trees, the run time and memory usage exponentially increase with each iteration... e.g. 30 mins first, then 1 hour, then 2 hours etc. Obviously I can't run a model with this kind of resource usage, but why is this the case? I've tried boosting methods in other programs and have not run into exponentially increasing runtimes. Do I have a parameter set wrong?

Thanks!
kovacs_balazs_k

Answers

  • kovacs_balazs_kkovacs_balazs_k Member Posts: 2 Contributor I
    edited July 2020
    Same issue here in 2020, compatibility level of Split Validation operator: 9.7.001.
    I noticed this when I analized the logs about execution times. I also checked if this is indicated by the process status bar and I noticed that there is indeed a modeling operator (Neural Net or SVM) with an index of [2].  So the training phase runs twice .
    Edit: I investigated the issue using brakepoints after the Neural Net operator. The first time, it uses only 70% of the examples to train the network but the second time, the training was executed using the entire dataset.

    Edit 2: As I further investigated the issue, I think I figured out why does the split validation operator behave like this. The main steps of the Split Validation operator are:
    1) Runs the training subprocess using the training data set which is 70% of the entire sample by default. Stores the resulting model (let's call it model1) for later use in the testing subprocess. The performance of this model (if it is measured) is stored as one of the later outputs of the Split Validation operator on one of the corresponding ave ports.
    2) Runs the training subprocess again using the entire sample (100%). Sets the resulting model (let's call it model2) as the later output of the Split Validation operator on the output port mod.
    3) Runs the testing subprocess using the remaining portion of the entire sample (30% by default). The inner mod input port of the testing subprocess delivers model1 for testing purposes. The performance of this model (if it is measured) is stored as one of the later outputs of the Split Validation operator on one of the corresponding ave ports.

    So this behavior is intentional, but it would be better if I could turn off the learning for the entire data set using a parameter while I am searching for the best parameter combination. It could reduce the time of search to the half.
    BalazsBarany
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    if the Model output of the validation is not connected, it shouldn't run the model building twice.

    Regards,
    Balázs
  • jacobcybulskijacobcybulski Member, University Professor Posts: 391 Unicorn
    If you think of Split Validation as a kind of Cross-Validation then it makes sense. First the model is run on its training fold and performance statistics are collected, and then a model is created using all of the available data for its later deployment.
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Why not simply use cross-validation and avoid all these pitfalls associated with split validation in the first place?
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.