Can more data be harmful for a prediction?
Hello everyone, as part of a university project I decided to experiment a bit with the data set I got and tried to input different aggregation levels of the data into auto model to compare the solutions.
At that point I was already a bit confused that my aggregated data often delivered better outputs than the divided one.
Since the data is advancing through 27 weeks and every week more regular attributes are added, I also tried to develop models for every week to see when a model would be theoretical operational for a first deployment.
I expected a slow increase in accuracy and gain throughout the weeks but instead I got an extreme peak in week 7 with a very high accuracy and a very good gain which then drastically declines and is only surpassed by the best model in week 19. From week 19 on the model decreases again but stays good until the predictions stops changing from week 23-27.
My questions now are if such a behavior is normal and why does it happen? If I look at the problem I can not really think about a reason why more information would be harmful to a prediction but it clearly seems to be the case. Furthermore, if the prediction would theoretically be used, should I stop at the prediction form week 19 or still use the model form week 27?
Sadly I am not allowed share the data.
Thanks for help in advance