Options

How to determine thresholds (small, medium or large change) for the level of difference per year ?

N_28N_28 Member Posts: 9 Learner I

Hi all,

I would like to establish tresholds for the level of difference per year such that I can classify my numerical values in small-, medium-, large change. So that I can conclude what the influence of OCVID-19 is based on my attributes. 

However, I have the following obstacles:

  1. I want to determine thresholds for my data to classify this as small—medium-large change. But, I do not know how to do this using the dataset?  And whether I have to plot some kind of graph wherein I can do this?
  2. Furthermore, the annual difference can be either negative or positive. However, when I use the classification small-medium-large, I cannot define if it is a negative change or positive. How can I resolve this issue? I was thinking about adding an extra attribute to the small-medium-large classification with either positive or negative, however, would this negatively impact if I am going to make a prediction model? Or is there a better way to do this?
  3. Also, which operators do I have to use for the problems mentioned above?
Thank you so much in advance!

Answers

  • Options
    N_28N_28 Member Posts: 9 Learner I
    Can anyone please help me out? I am really struggling with this for days. I thought of making a bell curve in RapidMiner and then take the 1/3 as small, 2/3 as medium and so on. Unfortunately, this did not work out. Instead, I proceeded with the aggregation operator by applying the percentile option in the aggregate function for e.g. the difference between 2014 and 2015 for positive value only, followed by the difference in years 2015 and 2016 and so on as I want to observe a trend. I then used the generate attributes operator to average all those years with percentile 33 for small change etc... Is this a good approach? And what about my positive and negative values? As I now only used this method for my positive values?
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hi,
    have you checked the discretize operators? Sounds like a job for them.

    Best,
    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    N_28N_28 Member Posts: 9 Learner I
    Thank you so much for responding @mschmitz! I have been advised to plot my data, ideally if it fits in a bell curve, in order to determine the size of my dataset (the range of the values in my dataset). Such that I can divide this (e.g. in a bell curve) in 3 sections. So, e.g. 1/3 of the dataset is a  “small change, 2/3 is a  “medium change” and the rest should be a  large change.

    I have been trying the discretize operator as you advised to achieve this. I have made 3 bins so that I can determine this small, medium and large annual change (in years).

    To give you an example this is what the results are before and after discretization.



    Do you think this is a good way to divide the difference of the annual change (in years) in my dataset in the buckets small-, medium- and large change instead of taking a percentile (33) and percentile (66)? And what about my negative values and infinite values does this ruin my conclusions later on for indicating a change?
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hi,
    there is no real easy way of doing it. In the end one needs to adapt these thresholds to the business problem at hand.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.