Options

Used the template to learn about outliers for credit card fraud detection

tonyboy9tonyboy9 Member Posts: 113 Contributor II
edited September 2020 in Help
With a few changes to the template, this is my process.



I used x-means and Detect Outlier (LOF) to detect possible fraud. The original data set contains over 284,000 rows. I selected out the first 3,000 rows for my first try.
These are the results, left half and right half. I see Outlier(s) from high to low. 



In the right half, I see Class = 1 only in rows 2 and 5. I would guess those are outliers.

Row 2 Outlier = 12.559. Row 5 Outlier = 8.030. There are higher value outliers nearby. Since both these have Class = 1, do I assume these are probably instances of fraud?





To compare, I selected out 5,000 rows for a bigger data set. Detect Outlier (LOF) took longer to run, but I got results. The process remained the same, the retrieve data set now has 5,000 rows. 

This time Class = 1 happens twice, Outliers are 16.921 and 10.364, not high on the list of Outlier(s) from high to low. 

Where Class = 1 (fraud?), should not Outlier scores be higher?





What am I possibly missing here?

Thanks for your time.

Tony







Tagged:

Best Answer

Answers

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Not exactly, because LOF is finding outliers with respect to its neighbors.  So to the extent that you have two clusters defined, and one of them contains many cases of fraud, then those that are in the midst of those clusters may not have high outlier scores.  The outlier scoring and the clustering are not really doing the same things.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    tonyboy9tonyboy9 Member Posts: 113 Contributor II
    Thank you, Brian, your point is well taken: "The outlier scoring and the clustering are not really doing the same things."

    Please see my screen shot, where Outliers are high to low. Four of the highest are in clusters 0 and 1. Does this not mean the higher the Outlier score, the farther out is the Outlier, therefore fraud is more likely in those first four row numbers?

    Thanks once more.

    Tony
     

Sign In or Register to comment.