Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
how to remove outliers by ID or in general?
hi,
I tried out 3 different outlier de tection algorithms on my dataset...
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.2.001" expanded="true" height="68" name="Retrieve Master3Klassen_nominal" width="90" x="45" y="30">
<parameter key="repository_entry" value="../../../data/Master3Klassen_nominal"/>
</operator>
<operator activated="true" class="generate_id" compatibility="7.2.001" expanded="true" height="82" name="Generate ID" width="90" x="112" y="165"/>
<operator activated="true" class="normalize" compatibility="7.1.001" expanded="true" height="103" name="Normalize" width="90" x="246" y="165">
<parameter key="method" value="range transformation"/>
<parameter key="min" value="-1.0"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.2.001" expanded="true" height="124" name="Multiply" width="90" x="346" y="142"/>
<operator activated="true" class="anomalydetection:Connectivity-Based Outlier Factor (COF)" compatibility="2.3.002" expanded="true" height="103" name="Connectivity-Based Outlier Factor (COF)" width="90" x="581" y="442">
<parameter key="k" value="5"/>
<parameter key="parallelize evaluation process" value="true"/>
</operator>
<operator activated="true" class="anomalydetection:Local Outlier Factor (LOF)" compatibility="2.3.002" expanded="true" height="103" name="Local Outlier Factor (LOF)" width="90" x="581" y="187">
<parameter key="k_min (MinPtsLB)" value="3"/>
<parameter key="k_max (MinPtsUB)" value="30"/>
<parameter key="numerical_measure" value="CamberraDistance"/>
<parameter key="parallelize evaluation process" value="true"/>
</operator>
<operator activated="true" class="anomalydetection:k-NN Global Anomaly Score" compatibility="2.3.002" expanded="true" height="103" name="k-NN Global Anomaly Score (2)" width="90" x="514" y="75">
<parameter key="k" value="5"/>
<parameter key="use k-th neighbor distance only (no average)" value="true"/>
<parameter key="numerical_measure" value="CamberraDistance"/>
<parameter key="parallelize evaluation process" value="true"/>
</operator>
<connect from_op="Retrieve Master3Klassen_nominal" from_port="output" to_op="Generate ID" to_port="example set input"/>
<connect from_op="Generate ID" from_port="example set output" to_op="Normalize" to_port="example set input"/>
<connect from_op="Normalize" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="k-NN Global Anomaly Score (2)" to_port="example set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Local Outlier Factor (LOF)" to_port="example set"/>
<connect from_op="Multiply" from_port="output 3" to_op="Connectivity-Based Outlier Factor (COF)" to_port="example set"/>
<connect from_op="Connectivity-Based Outlier Factor (COF)" from_port="example set" to_port="result 6"/>
<connect from_op="Connectivity-Based Outlier Factor (COF)" from_port="model" to_port="result 5"/>
<connect from_op="Local Outlier Factor (LOF)" from_port="example set" to_port="result 3"/>
<connect from_op="Local Outlier Factor (LOF)" from_port="model" to_port="result 4"/>
<connect from_op="k-NN Global Anomaly Score (2)" from_port="example set" to_port="result 1"/>
<connect from_op="k-NN Global Anomaly Score (2)" from_port="model" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
<portSpacing port="sink_result 6" spacing="0"/>
<portSpacing port="sink_result 7" spacing="0"/>
</process>
</operator>
</process>
now,
I want to sort outliers by most significant ones on top (e.g. TOP 100 outliers) for all 3 detector algorithms, and select those, that have the same ID under the top 100...
and then I want to remove them out of the dataset, identified by those same ID's, is this somehow possible? I don't know how to do this...
0
Answers
you can use sort operator, select the outlier score as your column for sorting and then select descending.
After that use the "Filter Examples Range" and provide 1 as first example and 100 as last example that will give you the top 100 outliers
ok thanks, and is there any way to do this in a combined fashion for more than 1 outlier algorithm, e.g I have 4 columns of different outlier algos and I sort all of them descending, and pick the ID's with the TOP 100 highest outlier scores of all 4 Algorithms... then I want to select those ID's that are present in all the 4 algos,
like SVMLIB_Outlier_ID in COF_ID in LOF_ID... etc. how should I do that formula for it? or how to do the process?
Hi,
have you tried Generate Aggregation to get the max() or the avg() of the 4 scores and filter on it?
~Martin
Dortmund, Germany
no didnt try avg or max yet, but I want a range, lets say 2 sigma of deviation from the outliers, that means 10% of my original dataset... therefore I have to sort by highest outlier rank first, and cut the first 10% of the entries ...
but I want to do that now for those ID's only, that are present in all the Top 10% of all 4 outlier algorithms scores... is that possible to compare ID's for presence in the top 10% of a column or so?
Hi Fred,
my idea would be:
Loop Attributes on the 4 outlier scores, take the top k with Filter Examples Range and use join afterwards to find those which where present in all of them.
~Martin
Dortmund, Germany
ok thanks I will try that