Can more data be harmful for a prediction?

TycheTyche Member Posts: 6 Newbie

Hello everyone, as part of a university project I decided to experiment a bit with the data set I got and tried to input different aggregation levels of the data into auto model to compare the solutions.
At that point I was already a bit confused that my aggregated data often delivered better outputs than the divided one.
Since the data is advancing through 27 weeks and every week more regular attributes are added, I also tried to develop models for every week to see when a model would be theoretical operational for a first deployment.
I expected a slow increase in accuracy and gain throughout the weeks but instead I got an extreme peak in week 7 with a very high accuracy and a very good gain which then drastically declines and is only surpassed by the best model in week 19. From week 19 on the model decreases again but stays good until the predictions stops changing from week 23-27.

My questions now are if such a behavior is normal and why does it happen? If I look at the problem I can not really think about a reason why more information would be harmful to a prediction but it clearly seems to be the case. Furthermore, if the prediction would theoretically be used, should I stop at the prediction form week 19 or still use the model form week 27?

Sadly I am not allowed share the data.

Thanks for help in advance


Best Answer


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,505 RM Data Scientist
    Great post!

    Thus, it is not always the case that more information leads to better outcomes.  For example, considering too much information in the form of too many attributes can definitely lead to less robust models because it encourages overfitting, and in some cases can even make it difficult for the algorithm to identify the true signal amidst all the extra noise.  This is why "feature selection" is an approach in data science projects, to try to reduce the factors considered to those which have more consistent, stronger relationships with the target.
    I would like to counter with regularization here? If I properly regulize, than my models should not fit into this "noise". Thats for me the whole argument, why the learning curve (sample size vs performance) should saturate?


    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.