Logistic Regression on large datasets:RapidMiner vs. SAS

cpmysore · March 2016

Hi!
Background: I am a consultant working with a customer to replace SAS with Rapidminer studio (not server). Most of the analysts work on developing marketing scorecards (logistic/decision tree).
I have read the most of the informative blogs but please excuse me for re-posting some of the niggling questions
1. SAS vs. Rapidminer: Predictions using the software will not match due to difference in underlying technique. How does Customer validate historical predictions going forward (Model development in SAS but validation using Rapidminer)?
2. Prediction vs. Explanation: My customer uses the beta coefficients and odds ratio to derive insights. In Rapidminer, how will they read and interpret the weights of explanatory variables?
3. Small vs. Large Data set: Customer currently has 1million records and 3000 attributes which is analysed on an 8GB Ram Dell Inspiron 5000 series laptop. Customer is not keen on using sampling/extrapolation route of analysis nor wants to upgrade to server version at this stage of transition (SAS to Rapidminer). What are the alternatives?
a. Pre-processing: What will be the loop/macro design to run step-wise logistic regression?
b. Radoop/Stream Database: Is this an option they can adopt to run logistic regression?

earmijo · March 2016

Let me start the discussion by clarifying that the Logistic Regression operator in Rapidminer is not running a classical logistic regression. It runs what's called a Kernel Logistic Regression. The weka operator W-Logistic Regression will run the classical LR.

1) My comment above might explain the differences between SAS and RM.
2) See the Weka operator. You get both beta coefficients and odd ratios.
3) I don't see why this might be a problem.
4) You can do forward/backward variable selection.
5) No clue.

JEdward · March 2016

earmijo has answered the the first 4 questions so I'll jump onto the last one.

5) Radoop IMO wouldn't be necessary for such a small sample of data. 1,000,000 records is really laptop size these days. Radoop is also really useful once your client has invested in hadoop infrastructure for storing the data across multiple servers, it sounds like they aren't at this point yet; if they don't have a cluster they don't need Radoop.

I'll also add a few extra comments on the first few:
1) you can bring scored data into RapidMiner from other tools and mark a label attribute & a prediction attribute. This means that all RapidMiner's evaluation methods can be used (for example T-Tests, etc).
3) I recently used w-logistic on a 2 million record set on my 16GB laptop, you should be fine. If you do run into problems let us know because there's always ways.

cpmysore · March 2016

Thanks - existence of Weka logistic regression is an eye opener for me. I plan to do simulation using data generator.

JEdward · March 2016

Not only do you have Weka as an option, but also R, Python, Octave (MATLAB), Java... (& C/C++ via JNI, if you're feeling brave).
RapidMiner is deceptively flexible.

MartinLiebig · March 2016

Hi,

it is wonderful to see that you try to replace SAS! If you need any help, you can also contact me directly at mschmitz at rapidminer dot com. I think this is something which should be supported from our professional services.

Best,
Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Logistic Regression on large datasets:RapidMiner vs. SAS

Answers