RapidMiner 9.7 is Now Available
Lots of amazing new improvements including true version control! Learn more about what's new here.
Replicating RapidMiner RandomForest Results in Python
However, I would like to now try to replicate the RapidMiner results in Python. But, in order to do so, I would like to understand a little better about how RapidMiner is making calculations in the string data. For example, one of my string attributes is a SIC code (Standard Industrial Classification). These codes appear numeric but I am regarding them as polynomial to avoid the algorithm trying to assign an order of importance to them which wouldn't make sense.
When it comes to attributes like these, I don't know how RapidMiner is using them. Python libraries like sklearn require all Random Forest inputs to be numeric and suggest things like 'one hot encoding' for converting non numeric data to numeric. However, there are over 800 unique SIC codes in my data and one-hot encoding is not practical in such a situation and the SIC code does appear to be an attribute of very high importance which I cannot just remove.
Is Rapidminer performing one hot encoding in the background here?
What Python library should I use to behave most like RapidMiner - allowing polynomials and dates?