Trying to understand MLP output

herbert12345herbert12345 Member Posts: 3 Contributor I
edited November 2018 in Help

I am currently trying to understand the output of the W-MultilayerPerceptron operator. Let us consider a toy model without hidden layers. Output might look like this.

Linear Node 0
    Inputs    Weights
    Threshold    0.4052907755005098
    Attrib O3    -0.2617907901506467
    Attrib NO2    -0.05083306647141619
    Attrib Altitude    -0.14881316186685326
    Attrib z    0.35660878655615114
    Attrib sza_rad    -0.44846864905805994
    Node 0
From my understanding this should be equivalent to a linear regression. So I train a LinearRegression model with the same input data using the results from the above "MLP" as label (in order to rule out differences in the fitting algorithm). Results show that the model indeed reproduces the results from the "MLP" perfectly. The coefficients however are completely different:

- 0.0000070221 * O3
- 0.0000717637 * NO2
- 0.0004435178 * Altitude
+ 0.0003188475 * z
- 0.0040543204 * SZA*pi/180.
+ 0.0145570907
I assume that this is because of the normalization done in the MLP operator. So here's the question: Assume I want to implement the above "MLP" into my own code: How must I process my data and the results?

Thanks for your reply


  • Options
    wesselwessel Member Posts: 537 Maven
    From my understanding Linear Regression and a Single Layer Perceptron should produce different weight values.

    A single layer perceptron starts with random weights.
    Takes a single data points.
    Propagates the input forward in the network.
    Calculates the error.
    Finds the weight gradient that minimizes the error.
    Moves the weights in the direction of the gradient according to the learning speed.

    Linear regression calculates the optimal weights in closed form.

    At data normalisation.
    The Neural Net has the option to turn of the data normalisation.

    I think you could also normalise your data, so nothing changes.
    using: (value - min) / (max - min)
  • Options
    herbert12345herbert12345 Member Posts: 3 Contributor I
    Thank you for your reply.

    I understand that they might go different ways to obtain their weights. But assuming a fair amount of convergence, the weights should end up being about the same. Up to normalization that is. Indeed I manage to make them the same by turning on the "I" and "C"-options in the W-MLP operator.

    I think I have managed to understand how things work by now. The problem was in part caused by a misunderstanding of mine as to how things work. Still it troubles me that the W-MLP output is not complete in the sense that the normalization employed is not documented. (I believe now that it normalizes both attributes and labels to the interval [-1,1] using 2*(value-min)/(max-min)-1).

    What bothers me though is that my final model (i.e. with hidden layers) appears to have a certain bias. Well, I guess I can fix that.

    Thanks for helping
  • Options
    wesselwessel Member Posts: 537 Maven
    I believe this is standard when tanh sigmoid functions are used:
    2*(value-min)/(max-min)-1     [-1,1]

    When the normal sigoid, which is 1 / 1 + exp(-x) is used, its normalised to
    (value-min)/(max-min)    [0, 1]

    This is indeed poorly documented.

    Should I take a look in WEKA's source code? Or the RM source code?

    What you mean that final model have a certain bias?
    Don't all learners have a certain bias?

    this link very shortly mentions normalisation:
  • Options
    herbert12345herbert12345 Member Posts: 3 Contributor I
    This kind of makes sense. Although through exerimentation I found that the only way to get things right is to normalize to [-1,1] and use standard sigmoid nodes as in 1/(1+exp(-x)). Maybe a look into the source code might help to clear things up.

    About the bias: Looking closer I see that for some reason the prediction is actually wrong by a linear map, that is I get good correlations (as in 0.999...) but scatter plots show that the model is rather off. This could easily be fixed by applying a linear model in post of course but I think it is strange. d

    Edit: My fault. Shouldn't wonder about offsets if training data and validation data are processed in different ways ...  :-[
Sign In or Register to comment.