LOF on Text Data

tamberge · April 2019

Hello Team,

I am fairly new to RM and currently conducting some research on online text.
In particular I am trying to detect outliers from an set of documents by using the LOF operator.
Now I have some troubles, since the LOF for each document is very close to 1, no matter how I set the MinPtsUB and MinPtsLB.
Basically I have represented the each document as vector of term frequency and TF-IDF, before applying the LOF operator.
So I have two ExampleSets representing the corpus as, a matrix of TF values and a matrix of TF-IDF values, to check the differences.
However, for both matrices I get LOF values that are equal or very close to one, which does not make any sence to me.

Could you tell me, if and what I am doing wrong?

Best

Please find my XML enclosed:

<?xml version="1.0" encoding="UTF-8" ?>

- <process version="9.2.000">

- <context>

</context>

- <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">

- <process expanded="true">

- <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve PreppedTestData" width="90" x="112" y="34">

</operator>

- <operator activated="true" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">

</operator>

- <operator activated="true" class="detect_outlier_lof" compatibility="9.2.000" expanded="true" height="82" name="Detect Outlier (LOF)" width="90" x="447" y="34">

</operator>

- <operator activated="false" class="anomalydetection:Local Outlier Factor (LOF)" compatibility="2.4.001" expanded="true" height="103" name="Local Outlier Factor (LOF)" width="90" x="380" y="340">

</operator>

- <operator activated="true" class="store" compatibility="9.2.000" expanded="true" height="68" name="Store" width="90" x="648" y="34">

</operator>

- <operator activated="false" class="write_excel" compatibility="9.2.000" expanded="true" height="82" name="Write Excel" width="90" x="581" y="442">

</operator>

- <operator activated="true" class="retrieve" compatibility="9.2.000" expanded="true" height="68" name="Retrieve PreppedTestData (2)" width="90" x="112" y="187">

</operator>

- <operator activated="true" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes (2)" width="90" x="246" y="187">

</operator>

- <operator activated="true" class="detect_outlier_lof" compatibility="9.2.000" expanded="true" height="82" name="Detect Outlier (2)" width="90" x="447" y="187">

</operator>

- <operator activated="false" class="anomalydetection:Local Outlier Factor (LOF)" compatibility="2.4.001" expanded="true" height="103" name="Local Outlier Factor (2)" width="90" x="380" y="493">

</operator>

- <operator activated="true" class="store" compatibility="9.2.000" expanded="true" height="68" name="Store (2)" width="90" x="648" y="187">

</operator>

- <operator activated="false" class="write_excel" compatibility="9.2.000" expanded="true" height="82" name="Write Excel (2)" width="90" x="648" y="595">

</operator>

</process>

</operator>

</process>

tamberge · April 2019

So I have been trying different methods in all possible combinations for a test set of 26 examples:

changing MinPts UB and LB, (1-2, 2-3, 5-10)

choosing different vectors (TF,TF-IDC, Term Occurence, Binary Term Occurence),

pruning (filtering frequent words, and filtering unfrequent words)

However, I was not able to get values that are LOF >> 1.

So does anyone have a theory, where this is coming from?

I can also share the data, if you want.

Telcontar120 · April 2019

I can't see your text data, but this is likely an artifact of the "curse of dimensionality" meaning that with a large TF-IDF vector, the multivariate differences between set members is simply not large enough to register under the LOF algorithm. This can easily happen if there are lots of terms in common and only a few differentiating terms. You might resolve this better by using a reduced wordlist to generate your TF-IDF matrix with only words that are likely to be differentiating ones.
Or you could switch to a different outlier detection algorithm that is more inherently distance based like k-nn anomaly score rather than density based, although you may still run into similar problems.

tamberge · April 2019

Hi Brian, Thank you for your quick reply. I guess I will just try to reduce the vector size by pruning more.
I will let you know, if it has any positive impact on the outcome!
Thanks again!

tamberge · May 2019

I have found a solution to the challenge. Not using any pruning and normalizing the data, before using the LOF operator.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

LOF on Text Data

Best Answer

Answers