🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

"Insufficient results with M5P regression tree"

michaelhechtmichaelhecht Member Posts: 89  Guru
edited May 23 in Help
Hello,

if anyone is interested please try the following:

produce a file containing two columns

x = 0, 0.1, 0.2, ..., 12.6;
y = sin(x)

Then apply M5P (with or without normalization).

The result is quite disappointing. Does anyone know how to get an acceptable result?
I expected to get something like a picewise linear approximation of the sin function,
but got something far away from this.

Thank You.
Tagged:

Answers

  • keithkeith Member Posts: 157  Guru
    michaelhecht wrote:

    Hello,

    if anyone is interested please try the following:

    It would actually be more helpful (and more likely to generate a response), if you include the XML for the process you are running in your forum post.

    produce a file containing two columns

    x = 0, 0.1, 0.2, ..., 12.6;
    y = sin(x)

    Then apply M5P (with or without normalization).

    The result is quite disappointing. Does anyone know how to get an acceptable result?
    I expected to get something like a picewise linear approximation of the sin function,
    but got something far away from this.
    There's a fair amount of ambiguity in your question.  What constitutes an acceptable result?  Is there a reason you believe this M5P is a good learner in this situation?  Did you experiment with any the options for the M5P learner? 

    I just tried the following process, and the only changes from the default settings are to click the check box for parameters N, U, and R :

    <operator name="Root" class="Process" expanded="yes">
        <operator name="CSVExampleSource" class="CSVExampleSource">
            <parameter key="filename" value="c:\temp\xy.csv"/>
            <parameter key="label_name" value="y"/>
        </operator>
        <operator name="W-M5P" class="W-M5P">
            <parameter key="keep_example_set" value="true"/>
            <parameter key="N" value="true"/>
            <parameter key="U" value="true"/>
            <parameter key="R" value="true"/>
        </operator>
        <operator name="ModelApplier" class="ModelApplier">
            <list key="application_parameters">
            </list>
        </operator>
    </operator>
    And the plot of x vs. prediction(y) looks, to my eyes, much more sin-like.  But I don't know if using an unpruned, unsmoothed learner makes sense for your problem.

    Keith
  • michaelhechtmichaelhecht Member Posts: 89  Guru
    Hi,

    sorry, here is the XML

    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="C:\Programme\Rapid-I\RapidMiner-4.4\sinus"/>
        </operator>
        <operator name="Normalization" class="Normalization">
        </operator>
        <operator name="W-M5P" class="W-M5P">
            <parameter key="keep_example_set" value="true"/>
            <parameter key="U" value="true"/>
            <parameter key="M" value="10.0"/>
        </operator>
        <operator name="ModelApplier" class="ModelApplier">
            <list key="application_parameters">
            </list>
        </operator>
    </operator>

    What I get is a piecewise constant result, i.e. the leafes of the tree are:  y = const
    Only the last leaf gives a linear model: y = 3.2196 * x - 4.5545

    If I had such a "really" linear model at all leafes of the tree, it would be ok, i.e. as
    I would expect it.
    There are no settings which can improve it, even if the tree could result in y = a*x+b
    in each leaf, which should give a better prediction. So why does'nt M5P behave like
    this?
    If I select the smoothed tree the results are even worse.

    I hope I could make my "problem" more clear to you.

    P.S.:
    Maybe if you google for "stepwise regression tree HUANG" or go directly to
    http://www.landcover.org/pdf/ijrs24_p75.pdf
    and there at page 77 (i.e. page 3 in the 16 pages document) you see what I
    mean. If this SRT algorithm would become a part of RapidMiner I would
    appreciate it  ;) , even if I don't understand why M5P doesn't behave comparable.
  • keithkeith Member Posts: 157  Guru
    I think you're getting into trouble because of the value of M (mininum # values per leaf), and the cyclical nature of the data.  If I take your process flow, change M from 10 to 5, I get linear models for nodes 1, 5, 6, 7, 10, 11, 14, 15, 16, and 20, and constant values elsewhere.
  • michaelhechtmichaelhecht Member Posts: 89  Guru
    Ok, I see the difference.

    Nevertheless, I cannot understand, why the fraction of constant leafs, i.e. y = const, increases if I change M from 5 to 6.
    I get 10 constant leafes more at positions where y = a*x+b would be better. Isn't the result with a constant regression
    worse than a non constant regression in the leafs?

    It's clear to me that thealgorithm is from Weka and not RapidMiner, so You cannot know in detail what happens.
    Nevertheless, I only want to understand, why, by increasing M, the number of constant leafs increases even
    if it worses the result.

    By the way, if you are an expert ;) , would it be possible to post a workflow for optimizing the parameters automatically.
    Up to now I didn't get the right feeling for applying meta methods like grid search or x-validation in the right way.

    Thank's in advance. (At least I need an answer on my question, the workflow would be nice)

  • keithkeith Member Posts: 157  Guru
    I'm far from an expert with RM.  :-)  And I've never used M5P before now, so what little I know it just came from a little experimentation and Googling yesterday.  The paper describing the method seems to be available at http://www.cs.waikato.ac.nz/pubs/wp/1996/uow-cs-wp-1996-23.pdf, and it may answer that question.  My guess is that there's some kind of rule that if the slope of the regression model is too close to zero, it gets rounded off to zero.  Maybe it's an interaction between the number of observations in each node and the regressed slope.  Beyond that, you'd probably have more luck getting an answer from a Weka list or forum

    As for the parameter optimization, take a look at 07_Meta/01_ParameterOptimization.xml in the RM samples directory.  The GridParameterOptimization node is where you'd specify what parameters you want to tinker with.
  • michaelhechtmichaelhecht Member Posts: 89  Guru
    Thank You again, I try to find an appropriate solution for me ;)

    The problem where I tested M5P was originally only for me to get an idea how M5P works.
    Finally I'm really in doubt applying this method to other data that I'm not familiar to.

Sign In or Register to comment.