Data preparation

BamboBambo Member Posts: 6 Contributor I
edited November 2018 in Help

Hello, my problem with data preparation is:data_prep_problem.PNG



















Screenshot above is structure of my data. My point is to predict label column using Windowing operator based on inputs like: Date(for ex. 12.08.2018), and attibutes (from 1 to 10). My problem is that my attributes return no value during some days, but still giving output in label column. But here is the trick. I want to predict label column, but I can't because in Rapidminer to predict label column I need to have all inputs filled ( for example I can't predict label column for given day, without having for example att3 value or any other att number filled in row). Rapidminer operator "Replace missing values" is no solution because I can't make (min,max, average) of my previous data, its not suitable for my problem. 


I would like to hear some advices, maybe some type of "aggregation?" of data, so that I don't need to use replace missing values operator?


Thanks in advance


Best Answer

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Solution Accepted

    Hi @Bambo


    Many elements of answer : 


    - You can use the Replace Missing Values (Series) by selecting the replace type (in your case maybe "linear interpolation" or  "previous value" is relevant).

     - You can use the Impute Missing Values operator. 

       "This operator estimates values for the missing values of the selected attributes by applying a model learned for missing values."

      The main difficulty is to find the best learner...

    Warning : It seems to me, that you need at least a valid attribute (which is not your case) to use this operator.

    So you can combine the 2 strategies : Use Replace Missing Values (Series) to obtain one or more valid attribute(s) and then apply Impute Missing Values.

     - Did you think of the trivial solution : Sometimes, in data-science, the simpliest model is the best one : So I propose to remove all your attributes ("atti ") and to perform an univariate time series analysis (using only the "Date" and "Label" attributes).......... to meditate


    I hope it helps,








  • BamboBambo Member Posts: 6 Contributor I

    Thanks lionelderkrikor, your suggestion to use "Impute missing values" is great solution for me, didn't know about this operator earlier. This operator combined with "Optimize parameters (Evolutionary)" doing magic. Really thanks for solving that problem.

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi again,


    You're welcome @Bambo.

    Glad you solved your problem.


    Your dataset sharpened my curiosity : What is the topic of your study ?






  • BamboBambo Member Posts: 6 Contributor I

    So basically every attibute number (from 1 to 10) are Windows programs that ran in given day, if its empty that mean user didn't ran that program in that day. The values inside every row means how much time user spend in given application(seconds converted to weight from 0 to 1). Labels are converted to numbers but in general its user (name, surname) that was using computer in that day. So long story short this model predict label, which user is most likely to use a computer in that day, based on input values.

Sign In or Register to comment.