Options

what kind of algoritmh Should I use?

Antonios1Antonios1 Member Posts: 9 Learner I
edited November 2020 in Help
Hi,
I have a dataset in wich I would like to detect a cluster, like the red dots in the attached simplified picture. I tried cluster analysis,  outliers analysis by using several operators (lof, k-means, x-means, decision tree etc.) and even the auto model, but It seem I am not able to understand if  I am on the right path and above all I don't know if the operators I chose are the right one. Might anybody help? 

Comments

  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi @Antonios1,

    this looks like the textbook example of a distance-based outlier detection. Check out the Anomaly Detection extension on the Marketplace, and Detect Outlier (LOF) included in Studio. Try to apply the appropriate algorithm on your data, play with parameters, and visualize the results.

    If this fails: try the Cross Distances operator and analyze the numerical distances between the elements. Try to find thresholds like "X neighbors inside a distance of Y" that describe the clusters the way you need them.

    Regards,
    Balázs 
  • Options
    Antonios1Antonios1 Member Posts: 9 Learner I
    thank you for helpiing Balázs,
    I tried with Studio Detect Outlier (LOF), Studio Detect Outlier (Distances) and Marketplace Local Outlier Probability (LOP).I played with the parameters.   By analizing the result, I do not get significative, at leas to me,  result exept for the LOF Operator where the clusters of numbers I wish to be detected has an outlier result of 0.
    If it can be of help my dataset is composed of 1 column with 3047 rows. (2781 rows cointaining numbers randomly ranging from 0 to 50000 266 rows contain a fixed number that in my case is 2900) and 2900 are the ones I'd like to detect.




  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    so can you use the LOF results? For example with Generate Attributes to create the Cluster attribute (outlier == 0)?

    If the example set is only just one attribute, you could aggregate by that attribute value and count the results. Then you could sort by the count descending and keep the top N classes, or remove classes having less than N examples.

    Regards,
    Balázs
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,509 RM Data Scientist
    i think the data set just contains no outlier :). It looks like an example why LOF shows you no outlier, while a normal KNN global anomaly score does say every blue one is an outlier.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    Antonios1Antonios1 Member Posts: 9 Learner I
    Think i have done, don't know if correct or not.
    Anyway, trying to understand.... Is my outliers 0 oucome correct ? I read that higher LOF value result,  detect Outliers. Maybe 0 Means the contrary, so in my case a lot of omogeneus values (2900) ?

  • Options
    Antonios1Antonios1 Member Posts: 9 Learner I
    edited November 2020
    thsnks for helping @mschmitz
    what kind of algorithm might I use to detect the red ones?
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,509 RM Data Scientist
    you think the red ones are outliers? Common definitions of outliers would either call nothing outliers or the blue ones..

    Anyway, i've reproduced your data set and used a KNN global anomaly score on it. The outlier score seperates the gaussian cluster and the random noise very well:


    Attached is the process

    Best,
    Martin

    <?xml version="1.0" encoding="UTF-8"?><process version="9.8.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.8.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="9.8.000" expanded="true" height="68" name="Generate Data" width="90" x="45" y="34">
            <parameter key="target_function" value="random"/>
            <parameter key="number_examples" value="50"/>
            <parameter key="number_of_attributes" value="2"/>
            <parameter key="attributes_lower_bound" value="-10.0"/>
            <parameter key="attributes_upper_bound" value="10.0"/>
            <parameter key="gaussian_standard_deviation" value="10.0"/>
            <parameter key="largest_radius" value="10.0"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.8.000" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
            <list key="function_descriptions">
              <parameter key="label" value="&quot;random&quot;"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <operator activated="true" class="generate_data" compatibility="9.8.000" expanded="true" height="68" name="Generate Data (2)" width="90" x="45" y="136">
            <parameter key="target_function" value="single gaussian cluster"/>
            <parameter key="number_examples" value="1000"/>
            <parameter key="number_of_attributes" value="2"/>
            <parameter key="attributes_lower_bound" value="-10.0"/>
            <parameter key="attributes_upper_bound" value="10.0"/>
            <parameter key="gaussian_standard_deviation" value="0.5"/>
            <parameter key="largest_radius" value="10.0"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.8.000" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="179" y="136">
            <list key="function_descriptions">
              <parameter key="label" value="&quot;gaussian&quot;"/>
              <parameter key="att1" value="att1-5"/>
              <parameter key="att2" value="att2+3"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <operator activated="true" class="append" compatibility="9.8.000" expanded="true" height="103" name="Append" width="90" x="380" y="34">
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="merge_type" value="all"/>
          </operator>
          <operator activated="true" class="anomalydetection:k-NN Global Anomaly Score" compatibility="2.4.001" expanded="true" height="103" name="k-NN Global Anomaly Score" width="90" x="581" y="34">
            <parameter key="k" value="10"/>
            <parameter key="use k-th neighbor distance only (no average)" value="false"/>
            <parameter key="measure_types" value="MixedMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="GeneralizedIDivergence"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
            <parameter key="parallelize evaluation process" value="false"/>
            <parameter key="number of threads" value="8"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Append" to_port="example set 1"/>
          <connect from_op="Generate Data (2)" from_port="output" to_op="Generate Attributes (2)" to_port="example set input"/>
          <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
          <connect from_op="Append" from_port="merged set" to_op="k-NN Global Anomaly Score" to_port="example set"/>
          <connect from_op="k-NN Global Anomaly Score" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>




    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    Antonios1Antonios1 Member Posts: 9 Learner I
    thank you  @mschmitz. Your suggestion spot exactly my omogeneous cluster of numbers hidden among the noise. 266 on 266.  I also understood I can import the process :-) and I'll study it to understand more. So in the end If I am right I  think I have understood that the lower the outlier value the higher probability it's the type of cluter I am looking for. isn't it ? If my assumption is correct, one of the operators suggesgted by @BalazsBarany (LOF)  works correctly too by identifiying with an outlier value of 0 "Zero" ,  my hidden cluster. Thank you @BalazsBarany , thank you @mschmitz
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,509 RM Data Scientist
    exactly. Keep in mind, that usually outliers have a high score. In your case you search outliers which are 'normal points' in common definitions.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.