Data preparation
Hello, my problem with data preparation is:
Screenshot above is structure of my data. My point is to predict label column using Windowing operator based on inputs like: Date(for ex. 12.08.2018), and attibutes (from 1 to 10). My problem is that my attributes return no value during some days, but still giving output in label column. But here is the trick. I want to predict label column, but I can't because in Rapidminer to predict label column I need to have all inputs filled ( for example I can't predict label column for given day, without having for example att3 value or any other att number filled in row). Rapidminer operator "Replace missing values" is no solution because I can't make (min,max, average) of my previous data, its not suitable for my problem.
I would like to hear some advices, maybe some type of "aggregation?" of data, so that I don't need to use replace missing values operator?
Thanks in advance
Best Answer
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Hi @Bambo
Many elements of answer :
- You can use the Replace Missing Values (Series) by selecting the replace type (in your case maybe "linear interpolation" or "previous value" is relevant).
- You can use the Impute Missing Values operator.
"This operator estimates values for the missing values of the selected attributes by applying a model learned for missing values."
The main difficulty is to find the best learner...
Warning : It seems to me, that you need at least a valid attribute (which is not your case) to use this operator.
So you can combine the 2 strategies : Use Replace Missing Values (Series) to obtain one or more valid attribute(s) and then apply Impute Missing Values.
- Did you think of the trivial solution : Sometimes, in data-science, the simpliest model is the best one : So I propose to remove all your attributes ("atti ") and to perform an univariate time series analysis (using only the "Date" and "Label" attributes).......... to meditate
I hope it helps,
Regards,
Lionel
0
Answers
Thanks lionelderkrikor, your suggestion to use "Impute missing values" is great solution for me, didn't know about this operator earlier. This operator combined with "Optimize parameters (Evolutionary)" doing magic. Really thanks for solving that problem.
Hi again,
You're welcome @Bambo.
Glad you solved your problem.
Your dataset sharpened my curiosity : What is the topic of your study ?
Regards,
Lionel
So basically every attibute number (from 1 to 10) are Windows programs that ran in given day, if its empty that mean user didn't ran that program in that day. The values inside every row means how much time user spend in given application(seconds converted to weight from 0 to 1). Labels are converted to numbers but in general its user (name, surname) that was using computer in that day. So long story short this model predict label, which user is most likely to use a computer in that day, based on input values.