Replicating RapidMiner RandomForest Results in Python

B00100719 · January 2019

Hi, I have a Random Forest binary classification model which, after dimensional reduction, I have 13 variables. Most are numeric. However, I also have a date and a couple of polynomial attributes (eg SIC code). I am getting accuracy of almost 75% which, for the complexity of the problem, I am reasonably pleased about.

However, I would like to now try to replicate the RapidMiner results in Python. But, in order to do so, I would like to understand a little better about how RapidMiner is making calculations in the string data. For example, one of my string attributes is a SIC code (Standard Industrial Classification). These codes appear numeric but I am regarding them as polynomial to avoid the algorithm trying to assign an order of importance to them which wouldn't make sense.

When it comes to attributes like these, I don't know how RapidMiner is using them. Python libraries like sklearn require all Random Forest inputs to be numeric and suggest things like 'one hot encoding' for converting non numeric data to numeric. However, there are over 800 unique SIC codes in my data and one-hot encoding is not practical in such a situation and the SIC code does appear to be an attribute of very high importance which I cannot just remove.

Is Rapidminer performing one hot encoding in the background here?
What Python library should I use to behave most like RapidMiner - allowing polynomials and dates?

yyhuang · January 2019

Hi @B00100719,

The machine learning models have different capabilities or compatibility for numerical, nominal attributes/labels. The random forest algorithm does handle both numerical and nominal attributes. If you need to encode the SIC for SVM, which can not handle nominal, try dummy coding or unique integers methods in "nominal to numerical" operator.

Image: https://us.v-cdn.net/6030995/uploads/editor/3h/l1tkzt2ls5uc.jpg

In other specific cases, e.g. zip codes in United States, the attribute would look like numeric values, 10003, 02184, but we would like to make it nominal to keep the leading zeros in zip. We will use "numerical to polynominal" to convert the zip codes.

HTH!

BalazsBarany · January 2019

Decision trees and random forests have well defined algorithms for handling categorical data.
You should find a function in your environment that implements this functionality.

However, dummy coding is of course functionally equivalent - it just creates hundreds of new 0/1 attributes.
As an optimization, you could look at your trees in RapidMiner and find if there are only a few relevant attribute values in the nominal attributes - you would then only keep these and change the rest to a constant value.

SGolbert · January 2019

If you leave the variable as nominal, RM will perform dummy coding. I don't think that is recommendable.

You have to clean the data somehow. Perhaps the codes are not relevant to the prediction and you can drop the variable. Perhaps you can recategorize the codes into a variable with a low number of bins.

Regards,

Sebastian

MartinLiebig · January 2019

If you leave the variable as nominal, RM will perform dummy coding. I don't think that is recommendable.

this is not true for Random Forests. We just do it on nominal measures. I think sklearn does not have this capability.

SGolbert · January 2019

Hi Martin,

I think we are saying the same. Just to clarify, it is not good to do dummy coding in a variable that can have thousands of possible values. I was not talking about the Random Forest operator itself.

In any case, one should know what the operator does, to avoid unwanted behavior.

Regards,

Sebastian

MartinLiebig · January 2019

@SGolbert,
okay, sorry. That's what i also meant!

BR,
martin

B00100719 · January 2019

Thanks for all of your responses

Converting Polynomial to Numerical seems like a really fundamental requirement that RapidMiner should do a better job of assisting with.

Regarding the responses above:

@yyhuang - thank you but I think you missed the content of my question

@SGolbert - thank you - I agree that leaving variables as polynomial isn't desirable. Unfortunately, as indicated in my OP, the SIC codes are apparently predictive, and I cant absolutely confirm that until I clean them properly, so I cannot just omit them. Secondly, SIC Codes are not my only polynomial variable - there are also occupation codes which may be highly predictive too. And there are many of those also,

@Balaz@BalazsBarany - thank you also. Regarding your comment about handling polynomial data, what in your experience is a good way to do it? The options as I see them are as follows:

1. One-Hot encoding (impractical unless I can at least bucket data into a higher level - still working that out)

2. Feature Hashing - I know next to nothing about this

3. Word2Vec - I have doubts about how effective this would be if training my own Word2Vec model - is it possible though if using the one created by Google for example?

Other options?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Replicating RapidMiner RandomForest Results in Python

Answers