image

🎉 🎉 RAPIDMINER 9.10 IS OUT!!! 🎉🎉

Download the latest version helping analytics teams accelerate time-to-value for streaming and IIOT use cases.

CLICK HERE TO DOWNLOAD

Do I always need to exec. a normalization/z-trans. to compare data each other and apply a ML model?

Mike0985Mike0985 Member Posts: 9 Contributor I
edited May 23 in Help
Dear all,

First of all, I am a beginner in using RM and data science techniques. Therefore, please be patient with me. I got the attached NBA data set from Kaggle I am using for a university project work / exam.

In general, do I always need to execute a normalization (z-transformation) to compare data each other within my data set, e.g. NBA statistics in my data set > columns L - Q and W - AB, and apply a machine learning model, e.g. naive bayes or linear/logistic regression?

Is an outlier detection a real machine learning model or more a technique to filter out outliers? At which number of detected outliers is it advantageous to apply an outlier detection, e.g. 10 or more detected outliers?

I would be very grateful if someone could help me.

Regards,
Michael


Best Answer

  • Mike0985Mike0985 Member Posts: 9 Contributor I
    Solution Accepted
    Hello Martin,

    Referring to your first comment "Well, it depends on the algorithm you are using. In general Normalization never hurts and it can help quite a bit. Some algorithms simply don't care like a decision tree. Then you loose interpretability but not predictive power."

    I still do not know exactly which ML model to use for my data set. I´m still working on this issue. I put the data set into the auto model function and different ML models, like Naive Bayes or a Regression, could be possible reffering to e.g. the accuracy. Therefore, would you say to try both, with and without normalization, with the auto model function to see and compare which would fit best?

    Reffering to you second comment "Outlier techniques can be used in several ways. They can be used to:"

    I applied the outlier detection to my data set (more than 21.000 rows) but the detection could only reduce the data set by less than 10 outliers but it took more than 30 min. Would you say in this case an outlier detection is also useful or better leave it and spare 30 min for the data science process?

    Thanks in advance.

    Regards,
    Michael



Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,053  RM Data Scientist
    Hi there,

    In general, do I always need to execute a normalization (z-transformation) to compare data each other within my data set, e.g. NBA statistics in my data set > columns L - Q and W - AB, and apply a machine learning model, e.g. naive bayes or linear/logistic regression?


    Well, it depends on the algorithm you are using. In general Normalization never hurts and it can help quite a bit. Some algorithms simply don't care like a decision tree. Then you loose interpretability but not predictive power.

    Is an outlier detection a real machine learning model or more a technique to filter out outliers? At which number of detected outliers is it advantageous to apply an outlier detection, e.g. 10 or more detected outliers?

    Outlier techniques can be used in several ways. They can be used to

    • Clean the data set to make it more interpretable
    • Get better models, since some models are effected by outliers (i.e Linear Regresion).
    • A technique to gather information / a ML model. For example in predictive maintenance or in fraud detection.

    It all depends on how you use it.


    Cheers,

    Martin



    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    Mike0985
Sign In or Register to comment.