Used the template to learn about outliers for credit card fraud detection

tonyboy9 · September 2020

With a few changes to the template, this is my process.

I used x-means and Detect Outlier (LOF) to detect possible fraud. The original data set contains over 284,000 rows. I selected out the first 3,000 rows for my first try.
These are the results, left half and right half. I see Outlier(s) from high to low.

In the right half, I see Class = 1 only in rows 2 and 5. I would guess those are outliers.

Row 2 Outlier = 12.559. Row 5 Outlier = 8.030. There are higher value outliers nearby. Since both these have Class = 1, do I assume these are probably instances of fraud?

To compare, I selected out 5,000 rows for a bigger data set. Detect Outlier (LOF) took longer to run, but I got results. The process remained the same, the retrieve data set now has 5,000 rows.

This time Class = 1 happens twice, Outliers are 16.921 and 10.364, not high on the list of Outlier(s) from high to low.

Where Class = 1 (fraud?), should not Outlier scores be higher?

Image: https://us.v-cdn.net/6030995/uploads/editor/sg/8w2prmye16ab.png

What am I possibly missing here?

Thanks for your time.

Tony

Telcontar120 · September 2020

It is hard to say for sure because I am not familiar with the details of your dataset. But it means, technically speaking, that these 4 are least like the other observations in their respective clusters. So probably that does mean that these are most likely to be fraudulent, but you should review the details of those individual cases to confirm that.

Telcontar120 · September 2020

Not exactly, because LOF is finding outliers with respect to its neighbors. So to the extent that you have two clusters defined, and one of them contains many cases of fraud, then those that are in the midst of those clusters may not have high outlier scores. The outlier scoring and the clustering are not really doing the same things.

tonyboy9 · September 2020

Thank you, Brian, your point is well taken: "The outlier scoring and the clustering are not really doing the same things."

Please see my screen shot, where Outliers are high to low. Four of the highest are in clusters 0 and 1. Does this not mean the higher the Outlier score, the farther out is the Outlier, therefore fraud is more likely in those first four row numbers?

Thanks once more.

Tony

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Used the template to learn about outliers for credit card fraud detection

Best Answer

Answers