"Is clustering and Decision Tree supposed to take hours to process?"

GViasuRaeisaeneGViasuRaeisaene Member Posts: 1 Contributor I
edited June 2019 in Help

Hi, 

 

I'm on a tight schedule and using Rapidminer for the first time. At the moment I have been running Agglomerative Clustering for over 5 hours and I'm not sure if I should just let it run still or if there is something wrong and I'm just wasting my time. My exampleset has 241762 examples and 25 attributes, most of which are polynominal. I ran into the same problem when trying to create a Decision Tree, but I just killed that process after 5 hours. 

 

Thanks,

Geta

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    It's hard to tell without seeing your process and data. Are the polynominals transformed into numbers via dummy coding?  Normally Decision Trees are fast, there must been a problem somewhere. 

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    Agglomerative clustering for many examples (rows) is always very slow.  The same is true for decision trees with nominal attributes and massive amounts of possible values.  I would suggest to use the following web site to find out which algorithms are feasible:

     

    http://mod.rapidminer.com/

     

    For clustering, I would try "k-Means (fast)" and even that might easily take some time.  For classification, I would start with Naive Bayes or k-NN which in general are pretty fast algorithms.

     

    Hope this helps,

    Ingo

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    In general I would be wary of using nominal attributes that have a high number of possible values in a predictive model.  Usually these types of attributes do not generalize very well because the patterns that are in the training data are too specific and simply overfit to the training sample.  You might want to consider some kind of feature engineering to reduce the number of possible values by aggregating or combining values in some sensible ways (e.g., 5-digit zip code to region, IP address to country, name to gender, etc.).  

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.