Replicating RapidMiner RandomForest Results in Python

B00100719B00100719 Member Posts: 11 Contributor II
edited January 11 in Help
Hi,   I have a Random Forest binary classification model which, after dimensional reduction, I have 13 variables.  Most are numeric.  However, I also have a date and a couple of polynomial attributes (eg SIC code).  I am getting accuracy of almost 75% which, for the  complexity of the problem, I am reasonably pleased about.

However, I would like to now try to replicate the RapidMiner results in Python.  But, in order to do so, I would like to understand a little better about how RapidMiner is making calculations in the string data.  For example, one of my string attributes is a SIC code (Standard Industrial Classification).  These codes appear numeric but I am regarding them as polynomial to avoid the algorithm trying to assign an order of importance to them which wouldn't make sense. 

When it comes to attributes like these, I don't know how RapidMiner is using them.  Python libraries like sklearn require all Random Forest inputs to be numeric and suggest things like 'one hot encoding' for converting non numeric data to numeric.  However, there are over 800 unique SIC codes in my data and one-hot encoding is not practical in such a situation and the SIC code does appear to be an attribute of very high importance which I cannot just remove.

Is Rapidminer performing one hot encoding in the background here?
What Python library should I use to behave most like RapidMiner - allowing polynomials and dates?
Tagged:

Answers

  • yyhuangyyhuang Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 195  RM Data Scientist
    Hi @B00100719,

    The machine learning models have different capabilities or compatibility for numerical, nominal attributes/labels. The random forest algorithm does handle both numerical and nominal attributes. If you need to encode the SIC for SVM, which can not handle nominal, try dummy coding or unique integers methods in "nominal to numerical" operator.

     

    In other specific cases, e.g. zip codes in United States, the attribute would look like numeric values, 10003, 02184, but we would like to make it nominal to keep the leading zeros in zip. We will use "numerical to polynominal" to convert the zip codes.

    HTH!
    BalazsBaranyB00100719
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 272   Unicorn
    Decision trees and random forests have well defined algorithms for handling categorical data.
    You should find a function in your environment that implements this functionality. 

    However, dummy coding is of course functionally equivalent - it just creates hundreds of new 0/1 attributes. 
    As an optimization, you could look at your trees in RapidMiner and find if there are only a few relevant attribute values in the nominal attributes - you would then only keep these and change the rest to a constant value.
    B00100719
  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 326   Unicorn
    If you leave the variable as nominal, RM will perform dummy coding. I don't think that is recommendable.

    You have to clean the data somehow. Perhaps the codes are not relevant to the prediction and you can drop the variable. Perhaps you can recategorize the codes into a variable with a low number of bins.

    Regards,
    Sebastian
    B00100719
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,064  RM Data Scientist
    If you leave the variable as nominal, RM will perform dummy coding. I don't think that is recommendable.
    this is not true for Random Forests. We just do it on nominal measures. I think sklearn does not have this capability.
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    gmeier
  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 326   Unicorn
    Hi Martin,

    I think we are saying the same. Just to clarify, it is not good to do dummy coding in a variable that can have thousands of possible values. I was not talking about the Random Forest operator itself.

    In any case, one should know what the operator does, to avoid unwanted behavior.

    Regards,
    Sebastian
    B00100719
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,064  RM Data Scientist
    @SGolbert,
    okay, sorry. That's what i also meant!

    BR,
    martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    B00100719
  • B00100719B00100719 Member Posts: 11 Contributor II
    Thanks for all of your responses

    Converting Polynomial to Numerical seems like a really fundamental requirement that RapidMiner should do a better job of assisting with.

    Regarding the responses above:

    @yyhuang - thank you but I think you missed the content of my question

    @SGolbert - thank you - I agree that leaving variables as polynomial isn't desirable.  Unfortunately, as indicated in my OP, the SIC codes are apparently predictive, and I cant absolutely confirm that until I clean them properly, so I cannot just omit them.  Secondly, SIC Codes are not my only polynomial variable - there are also occupation codes which may be highly predictive too.  And there are many of those also,

    @[email protected] - thank you also.  Regarding your comment about handling polynomial data, what in your experience is a good way to do it?  The options as I see them are as follows:

    1. One-Hot encoding (impractical unless I can at least bucket data into a higher level - still working that out)

    2.  Feature Hashing - I know next to nothing about this

    3. Word2Vec - I have doubts about how effective this would be if training my own Word2Vec model - is it possible though if using the one created by Google for example?

    Other options?
Sign In or Register to comment.