How to select the right data for prediction?

User111113User111113 Member Posts: 24 Maven
Hi All,

I have about 2 years of historical data which I can probably use to predict responses.

For example if I have to predict my response rate for Jan 2020 how can I say how much data would be enough to come close to actual rate.

------ should I look at how my data performed in Jan 2018, Jan 2019 and may be last 4 months from 2019 

----- or it should be last for months of 2019 and Jan 2019

----- or may be use everything I have which I am not comfortable with because of many outliers

when I compared actual and predicted for past few months they don't seem close at all because it was done manually (on a piece of paper)

How to select right data? 

Thank you.


Best Answer


  • PaulMSimpsonPaulMSimpson Member Posts: 8 Contributor II
    Since this data has date/time marks, you are looking at it the right way. I suggest you begin by using, say, the first 18 months of existing data to train, then test your model on the most recent 6 months of existing data. Then, compare the accuracy of that model to using the first 22 months of existing data to train, then test on the final 2 months of existing data. Whichever way gives better accuracy is what I would then do to predict January 2020. That is, either use the 18 months preceding Jan 2020 to do your predictions, or the 22 months preceding Jan 2020 to do your predictions. The reason the 18 months "may" be more accurate is that things change, processes change, something may change that influences the data. Simply experiment with different training data time lengths. 
  • User111113User111113 Member Posts: 24 Maven

    Thank you for your response. I will try both the ways and which method would be better to test accuracy in this case?

    For validation I use cross or split but in this case I would use cross or any other suggestions are welcome.

  • User111113User111113 Member Posts: 24 Maven
    I ran my model on first 18 months of data and predicted next 4 months instead of 6 just to see if it is effective.

    I did a performance test by putting original data for performance I predicted response rate 4 month (july-oct) and I already have the actual/original so I fed that as an input to see how much the result set would deviate from original and I got root mean squared error as 0.016

    which isn't bad what do you think?
  • PaulMSimpsonPaulMSimpson Member Posts: 8 Contributor II
    To respond to your earlier post today, I would not recommend using cross validation, since we are using earlier data to train the model, and later data to test it. Just split it 18 months oldest/6 months newest or 20 months/4 months, even 22 months/2 months, and build & test the model that way. Also, look at accuracy, the true positive rate and the true negative rate. Sometimes, an F1 score is the best metric to use to compare models. It depends on how evenly distributed your labeled 1's and 0's are. And, then, go ahead and try it with a different split point in time.
  • User111113User111113 Member Posts: 24 Maven
    I am not able to split my data I have 2 separate columns one for month and one for year..... no date column so I couldn't figure that out.

    Another way I thought is to add status column before loading data in RM which I did and divided it between old/new but still split operator takes only standard value like ratio and other default columns... how to split using status column from my data.

    Also I made RR column blank where status is new because that would be my test data.

    kindly help, thank you.
  • User111113User111113 Member Posts: 24 Maven
    I used filter based on status column and split the data do you think that's a right approach... I couldn't do it on split validation please see attached picture below.

Sign In or Register to comment.