Can I log-transform attributes with different scales?

Fred12Fred12 Member Posts: 344 Unicorn
edited November 2018 in Help

hi,

is it ok if I transform attributes that are skewed but have different scales? e.g. salaries data and age for a certain income group or so...

like to attributes that are both skewed, but where the scales are very different... or should I first normalize them and then log -transform? 

 

does the transformation affect the scales in some way? and after log-transformation, should I still normalize them or is it not necessary anymore?

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    From my perspective, these are just two separate transformations.  Log transformation will change the shape of the underlying distribution whereas normalization will not.  Normalize is used if you are trying to bring all attributes into the same absolute value scale, such as when you are using algorithms that are sensitive to the numerical scale of the attributes, such as k-NN or PCA.  Log transformations are typically done to make distributions tighter or more "normal" rather than skewed.   You can do one or the other or both, depending on what you are trying to accomplish.   

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Fred12Fred12 Member Posts: 344 Unicorn

    ok, but does log transformation affect scales also, or just the skew? but the scales will still be in the same kind of ranges?

    and if k-NN works better without normalization, as I encountered with my dataset -- should I stick to no normalization or still normalize the data?

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Naturally the log transformation alters the scale.  And depending on the orders of magnitude involved, it will not necessarily put attributes into the identical scale range either (unlike normalization which is used to put all attributes into the same scale).  Take a look at this quick sample process, which shows the impact of both normalization and log transforms on the labor negotiations sample dataset.

     

    I always normalize when using k-NN.  If k-NN works better in your dataset without normalization, then implicitly what is happening is you are giving more weight in your distance metric to attributes that have larger absolute values.  This may by chance turn out to be a good thing, but typically it is not an intended consequence, nor is the relative weighting of the different attributes necessarily easy to understand.  If you have any nominal attributes and you are therefore using the mixed Euclidean measures distance metric, the asymmetrical impact is typically even worse.  It may really be an indication of model overfitting (even if you are doing cross-validation).  

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    As a rule of thumb, I always normalize when working with K-nn. Using the z-transformation method transforms the data into a normal distribution with a mean=0 and a variance of 1, so skew doesn't come into play.  

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Looks like someone is up early on a Saturday... :)

Sign In or Register to comment.