"Non parametric regression"

Senecio · June 2009

Hi Everybody,

I have some questions whether it's feasible to get something like partial derivates for certain data quantiles via rapidminer. Before I explain my idea and problem I will give you a brief background on my data.

I currently doing an analyses on aggregated data on structural change in German agriculture. The use of aggregate data implies some drawbacks e.g. the cause - effect relations actually exist only at the individual level. Therefore on the aggregate level a closed theoretical model (esp. refering to functional form of the relation) between the dependent and inpedendent variables is not available. Furthermore some information on the aggregate level can at best be conceived as a rough dummy for the factor influencing the individual decision.
The linear and nested linear regressions show that for some variables the relation between indendent and dependent variable is clearly non-linear and may even show some breaks.

My idea for the analysis is the following:
a) take the data set and remove the outlier's at least the most dramatic ones based on an indicator e.g. Cook's D.
b) conduct a non-parametric regression using either a SVM or nearest neighbor approach (which looks to me as being the most equivalent to what is generally refered to kernel based regression; in the use of: http://en.wikipedia.org/wiki/Kernel_regression).
c) get the information on the partial derivates (first order would be sufficient) across the range of the variable.
d) investigate these derivates for marked non-linearities

a) and b) is quite straightforward but is it possible to do c) and d) in RapidMiner and if how?

Best Norbert

land · June 2009

Hi Norbert,
this is one of the few interessint questions in the past few weeks

Thank you very much for stimulating my brain.

But as you can see from my introduction, your question is very unusuall and so is the need for the requested feature. Hence RapidMiner does not provide you with out of the box derivatives. I assume, that the information you want to gain in c, are the symbolic form of the derivatives, because otherwise you will have difficulties to represent the data. But I wonder if you are able to make use of the symbolic form, because it will probably be lengthy and always non linear if the learner is non linear (like the SVM with kernels).

How many attributes and examples does your dataset have?

Greetings,
Sebastian

Senecio · June 2009

Hi Sebastian,

I got roughly 10.000 observations and basically 20-30 variables (if the transformed forms and interaction terms are not accounted for).
Getting the result in a symbolic form would definetly be nice, however this form could be "easily" interpretable both for me and the audience, I have my doubts?

Actually having just the numerical approxiamation and presenting the output in something like in Fig.2 of http://purl.umn.edu/51063 would be more than sufficient. Perhaps with a confidence intervall around the partial derivate.

Best Norbert

land · June 2009

Hi Norbert,
If you have 20 or more attributes, I doubt you could store the derivatives anyway. Symbolic probably wont fit into a humans brain and calculating a 20 dimensional lattice of numerical values of the derivatives could be either hard to comprehend for humans, too, or simply exceeding the memory. If you only use 100 points on each dimension's range, this would be 100^20 values, or formulated different little less than 2^139 values. Even modern 64 bit machines could struggle here

Or did I understood anything wrong in your setting?

Greetings,
Sebastian

Senecio · June 2009

Hi Sebastian,

Actually I'm not half as ambitious as you assume. Currently, I would be glad to have just the main effects, i.e. assuming all interaction terms between the variables are zero. Perhaps, one could (should) extent the setting to some simple interactions between two variables. As a result one would only change one variable at a time; while for the remaining the calculation of the dependent variable would be based on the real values and an interpolation (average) of several observation. So the data demand would not be quite as challenging.
To get back to your setting:
For each of the 20 variables one would take 100 measurements (each separated by a 1% quantile of the respective range) and each measurement would be based on sample of 100 observations.
This results in 10 million differences (20 * 100 * 100 * 50 (ok should 49.5)

) to calculate. Personally, I think to base the estimation of the partial effect at each quantile point on 5000 measurements is really not necessary. A few points less should suffice.

So I think a modern computer should be able to handle the problem.

Best

Norbert

land · June 2009

Hi Norbert,
again the day starts with an interessting issue

If I understood everything correctly (but it's a complex problem, so I might got something wrong), this should be feasible with rapid miner. But it will need a complex process with nested ExampleIteration, AttributeConstruction, MacroDefinition and ParameterIteration and several Learner...

Unfortunately the design of this process exceeds the scope of this forum, as already the hole topic did in some way. I would love to, but I cannot spend a few hours of my working time for this for free. Since our software is open source, we are living from consulting...If you are interested in consulting or another of our services, please email or phone us.

But now enough cheap advertising

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Non parametric regression"

Answers