"X-Validation for dependent data"
I am looking for a way to validate a model (classification) for which input consists of partially dependent examples. I read a couple of research papers that suggest using a h-block / hv-block cross validation as a more robust method to validate a model in such a scenario. Although I believe I generally understand those concepts, I am pretty much clueless when it comes to implementing them in Rapidminer.
To give a bit more color around my scenario, I am attaching a short csv file with made up data. I basically have a number of identical machines, each of them running independently from each other. All machines have the same attributes and the examples consist of those attribute values taken at different points in time during a production run (those time points are usually different for each machine, with irregular intervals). The label indicates, whether a machine needs maintenance during the current production run.
Ignoring the dependence of examples that belong to one unique machine and just running a regular cross validation across all data points leads to beautifully accurate models. However, applying those models to fresh and unseen data results in quite bad predictions (independent of the chosen model type).
I would like to know how others are dealing with such datasets. I also considered transforming the data so that the examples are transformed to attributes (at time x), leaving only one example per machine, but this would lead to a very wide and not necessarily useful dataset.
Also, and somehow related: My dataset is unbalanced with a ratio of about 0.65/0.35 for the two classes. How do I make sure that "useful" examples are chosen when I want to sample it down to a balanced dataset?
Thank you very much!