Compete in RapidMiner's 3rd Competition: Fantasy Football. Top prize is $750. Deadline December 19.
Download RapidMiner Studio or Server 8.0 Public Beta. Let us know how you like it! Ends November 27.
Watch RapidMiner's "Getting Started" videos on YouTube. Everything you need to do data science - fast and simple!
is it ok if I transform attributes that are skewed but have different scales? e.g. salaries data and age for a certain income group or so...
like to attributes that are both skewed, but where the scales are very different... or should I first normalize them and then log -transform?
does the transformation affect the scales in some way? and after log-transformation, should I still normalize them or is it not necessary anymore?
From my perspective, these are just two separate transformations. Log transformation will change the shape of the underlying distribution whereas normalization will not. Normalize is used if you are trying to bring all attributes into the same absolute value scale, such as when you are using algorithms that are sensitive to the numerical scale of the attributes, such as k-NN or PCA. Log transformations are typically done to make distributions tighter or more "normal" rather than skewed. You can do one or the other or both, depending on what you are trying to accomplish.
ok, but does log transformation affect scales also, or just the skew? but the scales will still be in the same kind of ranges?
and if k-NN works better without normalization, as I encountered with my dataset -- should I stick to no normalization or still normalize the data?
Naturally the log transformation alters the scale. And depending on the orders of magnitude involved, it will not necessarily put attributes into the identical scale range either (unlike normalization which is used to put all attributes into the same scale). Take a look at this quick sample process, which shows the impact of both normalization and log transforms on the labor negotiations sample dataset.
I always normalize when using k-NN. If k-NN works better in your dataset without normalization, then implicitly what is happening is you are giving more weight in your distance metric to attributes that have larger absolute values. This may by chance turn out to be a good thing, but typically it is not an intended consequence, nor is the relative weighting of the different attributes necessarily easy to understand. If you have any nominal attributes and you are therefore using the mixed Euclidean measures distance metric, the asymmetrical impact is typically even worse. It may really be an indication of model overfitting (even if you are doing cross-validation).
As a rule of thumb, I always normalize when working with K-nn. Using the z-transformation method transforms the data into a normal distribution with a mean=0 and a variance of 1, so skew doesn't come into play.