Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Handling Outliers
Hi All,
I have a question about outlier handling in Rapid Miner. Consider the following values of an attribute:
2
3
3
2
4
17
5
3
28
4
5
5
2
5
26
1
3
3
4
2
I have shortened the actual list of attribute values simply for the purposes of this question. As you can see, most of the observations are between 1 and 5. With a mean of just over 6, and a standard deviation of about 7.5, three values stand out as outliers, with two of them being more than two standard deviations away from the mean. With a much larger number of observations, these outliers become even more pronounced.
I've analyzed this variable using the Outlier operators in RapidMiner, and I'm confident that values above 8 are outliers, so my question is not about how to detect the outliers. My question is about *handling* them. I can use a filter to remove the entire observation associated with each outlier from my data set, but these observations have other attributes with accurate data that I want to keep and use in my model.--what I really want to do is set the outliers in this attribute to missing, and then use a missing value handler to set them to a normed value -- something like the mean, median or mode. This way I can keep the good attributes, "fix" the bad ones, and model without losing any observations. (I'm aware of the risks/issues associated with replacing data using descriptive statistics, and I'm choosing to accept those in this particular instance.)
I've tried doing this six ways from Sunday, but I'm not coming up with good ideas. I can add a series of "Declare Missing Values" operators in rapid succession and use each one to set each distinct outlier value to missing, but this is a real pain, especially if I have ten million observations with 25 distinct outlier values. I end up with 25 DMV operators in a row, cluttering my model and making it difficult to tell if I've treated all the outliers or not. I tried using a DMV numeric_value_filter with an expression of >8, but this parameter must not work the way I think it's supposed to, because I always get an 'attribute name undefined' error when I try that.
So, to boil it down: I would like to be able to set a range or expression on a numeric attribute that would set all values to missing in that attribute if they fall outside the range, or meet the expression parameters, and I'd like to accomplish it with a single operator. Any suggestions or solutions would be most appreciated.
Matt
I have a question about outlier handling in Rapid Miner. Consider the following values of an attribute:
2
3
3
2
4
17
5
3
28
4
5
5
2
5
26
1
3
3
4
2
I have shortened the actual list of attribute values simply for the purposes of this question. As you can see, most of the observations are between 1 and 5. With a mean of just over 6, and a standard deviation of about 7.5, three values stand out as outliers, with two of them being more than two standard deviations away from the mean. With a much larger number of observations, these outliers become even more pronounced.
I've analyzed this variable using the Outlier operators in RapidMiner, and I'm confident that values above 8 are outliers, so my question is not about how to detect the outliers. My question is about *handling* them. I can use a filter to remove the entire observation associated with each outlier from my data set, but these observations have other attributes with accurate data that I want to keep and use in my model.--what I really want to do is set the outliers in this attribute to missing, and then use a missing value handler to set them to a normed value -- something like the mean, median or mode. This way I can keep the good attributes, "fix" the bad ones, and model without losing any observations. (I'm aware of the risks/issues associated with replacing data using descriptive statistics, and I'm choosing to accept those in this particular instance.)
I've tried doing this six ways from Sunday, but I'm not coming up with good ideas. I can add a series of "Declare Missing Values" operators in rapid succession and use each one to set each distinct outlier value to missing, but this is a real pain, especially if I have ten million observations with 25 distinct outlier values. I end up with 25 DMV operators in a row, cluttering my model and making it difficult to tell if I've treated all the outliers or not. I tried using a DMV numeric_value_filter with an expression of >8, but this parameter must not work the way I think it's supposed to, because I always get an 'attribute name undefined' error when I try that.
So, to boil it down: I would like to be able to set a range or expression on a numeric attribute that would set all values to missing in that attribute if they fall outside the range, or meet the expression parameters, and I'd like to accomplish it with a single operator. Any suggestions or solutions would be most appreciated.
Matt
0
Best Answer
-
land RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 UnicornHi Matt,
if the process is cool, just add it to MyExperiment using the Community Extension!
Greetings,
Sebastian0
Answers
what about the following approach:
1. Copy your dataset using Multiplier.
2. Filter the original examples so that you have only examples with the attribute values being outliers.
3. Generate a new Attribute containing missings
4. Filter the original examples so that you have the examples that are no outliers
5. Generate a new Attribute with the same name as a copy of the original one
6. Append both sets.
7. Remove the original attribute / Rename the new one to the old name.
Greetings,
Sebastian
Matt