The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

Replicating RapidMiner RandomForest Results in Python

B00100719B00100719 Member Posts: 11 Contributor II
edited January 2019 in Help
Hi,   I have a Random Forest binary classification model which, after dimensional reduction, I have 13 variables.  Most are numeric.  However, I also have a date and a couple of polynomial attributes (eg SIC code).  I am getting accuracy of almost 75% which, for the  complexity of the problem, I am reasonably pleased about.

However, I would like to now try to replicate the RapidMiner results in Python.  But, in order to do so, I would like to understand a little better about how RapidMiner is making calculations in the string data.  For example, one of my string attributes is a SIC code (Standard Industrial Classification).  These codes appear numeric but I am regarding them as polynomial to avoid the algorithm trying to assign an order of importance to them which wouldn't make sense. 

When it comes to attributes like these, I don't know how RapidMiner is using them.  Python libraries like sklearn require all Random Forest inputs to be numeric and suggest things like 'one hot encoding' for converting non numeric data to numeric.  However, there are over 800 unique SIC codes in my data and one-hot encoding is not practical in such a situation and the SIC code does appear to be an attribute of very high importance which I cannot just remove.

Is Rapidminer performing one hot encoding in the background here?
What Python library should I use to behave most like RapidMiner - allowing polynomials and dates?
Tagged:

Answers

  • Options
    yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist
    Hi @B00100719,

    The machine learning models have different capabilities or compatibility for numerical, nominal attributes/labels. The random forest algorithm does handle both numerical and nominal attributes. If you need to encode the SIC for SVM, which can not handle nominal, try dummy coding or unique integers methods in "nominal to numerical" operator.

     

    In other specific cases, e.g. zip codes in United States, the attribute would look like numeric values, 10003, 02184, but we would like to make it nominal to keep the leading zeros in zip. We will use "numerical to polynominal" to convert the zip codes.

    HTH!
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Decision trees and random forests have well defined algorithms for handling categorical data.
    You should find a function in your environment that implements this functionality. 

    However, dummy coding is of course functionally equivalent - it just creates hundreds of new 0/1 attributes. 
    As an optimization, you could look at your trees in RapidMiner and find if there are only a few relevant attribute values in the nominal attributes - you would then only keep these and change the rest to a constant value.
  • Options
    SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn
    If you leave the variable as nominal, RM will perform dummy coding. I don't think that is recommendable.

    You have to clean the data somehow. Perhaps the codes are not relevant to the prediction and you can drop the variable. Perhaps you can recategorize the codes into a variable with a low number of bins.

    Regards,
    Sebastian
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,525 RM Data Scientist
    If you leave the variable as nominal, RM will perform dummy coding. I don't think that is recommendable.
    this is not true for Random Forests. We just do it on nominal measures. I think sklearn does not have this capability.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn
    Hi Martin,

    I think we are saying the same. Just to clarify, it is not good to do dummy coding in a variable that can have thousands of possible values. I was not talking about the Random Forest operator itself.

    In any case, one should know what the operator does, to avoid unwanted behavior.

    Regards,
    Sebastian
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,525 RM Data Scientist
    @SGolbert,
    okay, sorry. That's what i also meant!

    BR,
    martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    B00100719B00100719 Member Posts: 11 Contributor II
    Thanks for all of your responses

    Converting Polynomial to Numerical seems like a really fundamental requirement that RapidMiner should do a better job of assisting with.

    Regarding the responses above:

    @yyhuang - thank you but I think you missed the content of my question

    @SGolbert - thank you - I agree that leaving variables as polynomial isn't desirable.  Unfortunately, as indicated in my OP, the SIC codes are apparently predictive, and I cant absolutely confirm that until I clean them properly, so I cannot just omit them.  Secondly, SIC Codes are not my only polynomial variable - there are also occupation codes which may be highly predictive too.  And there are many of those also,

    @Balaz@BalazsBarany - thank you also.  Regarding your comment about handling polynomial data, what in your experience is a good way to do it?  The options as I see them are as follows:

    1. One-Hot encoding (impractical unless I can at least bucket data into a higher level - still working that out)

    2.  Feature Hashing - I know next to nothing about this

    3. Word2Vec - I have doubts about how effective this would be if training my own Word2Vec model - is it possible though if using the one created by Google for example?

    Other options?
Sign In or Register to comment.