remove uncorrelated attributes with respect to label attribute

wesselwessel Member Posts: 537 Maven
edited November 2018 in Help
Hello,

How do I remove uncorrelated attributes with respect to my label attribute?
RemoveCorrelatedFeatures seems to remove intercorrelated features, instead of features related to the label attribute.

Also when I make a CorrelationMatrix the label attribute doesn't show up.
I guess I don't want to make Matrix, just 1 single row, which has pairwise correlation with my label attribute.

Regards,

Wessel

Answers

  • haddockhaddock Member Posts: 849 Maven
    Hi Wessel,

    Is this the sort of thing?
    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
            <parameter key="target_function" value="random"/>
        </operator>
        <operator name="CorrelationMatrix" class="CorrelationMatrix">
            <parameter key="create_weights" value="true"/>
        </operator>
        <operator name="AttributeWeightSelection" class="AttributeWeightSelection">
        </operator>
    </operator>
  • wesselwessel Member Posts: 537 Maven
    Maybe.

    But how can I see if its working properly?
    How is CorrelationMatrix ranking attributes?
    I think intercorrelation, but I might be wrong. (A lot of redundancy a is bad)

    I guess what I want it correlation with respect to label. (Predictive power is good)


    Because I'm working with weather data, I have some expectations of the outcome.
    I expect wind-23, wind-47, wind-71, wind-94 to have the biggest auto correlation.
    But 47 and 71 are not in the top 10!

    So I think its calculating inter correlation,
    because its returning attributes that are on the sides of my attribute interval 23-95.
    Obviously they have less inter correlation (redundancy) that attributes in the middle.


    wind-23 0.8861533285392513
    wind-95 0.8828218809552825
    wind-24 0.8738616064506365
    wind-94 0.8707726262127967
    wind-25 0.8634980953805299
    wind-93 0.8607096030931946
    wind-26 0.855195678341873
    wind-92 0.8526740358567213
    wind-27 0.8488826077290765
    wind-91 0.8465315199805834

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    unfortunately CorrelationMatrix does not incorporate the label column. I think we will add an correlation based weighting in the next version.

    From a data miners perspective, the choice of a correlation for removing attributes is not always suitable. Take a look at the image at http://en.wikipedia.org/wiki/Correlation for getting an impression, why correlation might be a bad thing. Most of these clear dependencies can be discovered and used by a learner, although the correlation is 0!

    Greetings,
      Sebastian
  • wesselwessel Member Posts: 537 Maven
    Yes, but ...

    I want to use correlation on 1 single attribute.
    And this single attribute gets multiplied 100 times, when I use a history of 100.
    att1-0, att1-1, ..., att-100

    Now correlation is a good measure to find out att1-.* that have high auto correlation, predictive poewr
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    sorry if I missunderstood you, but then a constant attribute would be the best attribute? I mean, if it's the label attribute, then data mining becomes really easy :) If not, this attribute doesn't say anything about the label?

    Curious,
      Sebastian
  • wesselwessel Member Posts: 537 Maven
    Yes, in Time Series Data this is a bit confusing.
    If you have a better suggestion for names please.
    You have multiple things you measure, multiple attributes.
    So lets you measure 2 things, x and y:
    x    y
    ------
    x0 y0
    x1 y1
    x2 y2
    x3 y3
    x4 y4

    But after you convert this time series data into windowed examples:
    So now you have 4 attributes, and 1 label attribute "x-0"
    x-0  x-2  y-2 x-3 y-3
    ------------------------
    x3  x1  x1  x0  y0
    x4  x2  x2  x1  y1

    now you can learn the function:
    x-0 = f(x-2, y-2, x-3, y-3)


    but when you take a really big window
    you get a lot more attributes
    and it becomes infeasible to do CFS att selection on them all
    so then I want to grab all x-.* attributes
    learn some kind of auto regression function, preferably also use moving average smoothing
    x-0 = f(x-2, x-3, ..., x-10000)
    Then use this auto regression function which has found a trend and seasonal probably and construct a new att in my database.

    So then I can learn
    x-0 = f(x-2, y-2, x-3, y-3, seasonal)
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi again,
    I think I have understood now, what you are aiming at. But on the other hand, I don't see now, where you need the correlation...

    The way you are proposing seems to me to be equivalent to an additive regression with the first model learned only on the past label values.  Although the AdditiveRegression in RM would not cope with that, you could easily simulate it using an AttributeConstruction.

    Greetings,
      Sebastian
  • wesselwessel Member Posts: 537 Maven
    I found this nice picture here:

    http://upload.wikimedia.org/wikipedia/commons/8/84/Acf.svg

    image

    Sin with noise signal on top
    Auto correlation on the bottom

    And in R, the function acf and pacf can be used to produce such a plot.
Sign In or Register to comment.