Options

Copying values iteratively

iesnaolaiesnaola Member Posts: 8 Contributor II
edited November 2018 in Help

Dear Community,

 

I am dealing with some data preprocessing in order to fill some Missing Values on my Dataset. The problem is that I have two attributes with different sampling (i.e. one of them has data hourly, and the other one just daily). So, I have this dataset:

 

date           atr1   atr2

d1_09:00      5        ?

d1_10:00      6        ?

d1_11:00      5        20

d1_12:00      5        ?

...

d2_09:00      7        ?

d2_10:00      6        ?

d2_11:00      5        13

d2_12:00      6        ?

 

I would like to be able to take a value of attr2 and use it to fill MV until the next value is found. So, this example would be end somehow like:

date           atr1   atr2

d1_09:00      5        ?

d1_10:00      6        ?

d1_11:00      5        20

d1_12:00      5        20

...

d2_09:00      7        20

d2_10:00      6        20

d2_11:00      5        13

d2_12:00      6        13

 

I guess I should be using some "Loop" operator, but so far I couldn't achieve what I am looking for.

 

Anyone dealing with similar issues?

 

Thanks in advance,

Iker

Best Answer

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Solution Accepted

    Hi,

     

    i think replace Missing Values (Series) of Series extension should do the trick.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany

Answers

  • Options
    SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Hi Iker,

     

    If there is a single value for att2 every day, you could separate the date attribute into Day and Hour and then use the day as key for a left/right join. I attach a sample process.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="r_scripting:execute_r" compatibility="8.1.000" expanded="true" height="82" name="Execute R" width="90" x="112" y="34">
    <parameter key="script" value="# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;rm_main = function()&#10;{&#10; data = expand.grid(Day = 1:2, Hour = 0:23)&#10;&#10; data$att1 = sample(1:100, 48)&#10;&#10; data$att2 = NA&#10;&#10; data[data$Hour == 1, &quot;att2&quot;] &lt;- c(10, 20)&#10; &#10;&#10; # connect 2 output ports to see the results&#10; return(list(data))&#10;}&#10;"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter Examples" width="90" x="313" y="34">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="att2.is_not_missing."/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.1.000" expanded="true" height="82" name="Select Attributes" width="90" x="514" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="att2"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.1.000" expanded="true" height="82" name="Select Attributes (2)" width="90" x="514" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="att1"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="concurrency:join" compatibility="8.1.000" expanded="true" height="82" name="Join" width="90" x="715" y="34">
    <parameter key="join_type" value="right"/>
    <parameter key="use_id_attribute_as_key" value="false"/>
    <list key="key_attributes">
    <parameter key="Day" value="Day"/>
    </list>
    </operator>
    <connect from_op="Execute R" from_port="output 1" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="original" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Join" to_port="right"/>
    <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Join" to_port="left"/>
    <connect from_op="Join" from_port="join" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

    Sebastian

  • Options
    iesnaolaiesnaola Member Posts: 8 Contributor II

    Hi @SGolbert

    Even though I have already installed R Scripting extension (v 7.0.0) on my Rapidminer Studio (v7.1) I cannot load your example. I get the idea of what you say, but I cannot see how to implement it...

     

    BR

     

     

  • Options
    SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Hi,

     

    is it possible that you don't have the R path configured? (settings -> Preferences -> R Scripting)

     

    I attach a CSV file with the dataset just to be sure.

     

    Best,

    Sebastian

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn

    Hi,

    using R of course solves the problem. However, using R solves any problem with an arbitrary overhead.

    The single operator @mschmitz quoted will do the trick. (easier AND with less computational overhead)

     

    If you additionally have the problem that the timestamps are not nicely distributed, eg some of them are missing or they are not equidistant, try the "Resample Multiple Series" operator from the Jackhammer extension of the marketplace. May also be worth a look if you have many of the attributes as the "Replace Missing Values (Series)" operator only works on a single attribute.

     

    Greetings,

     Sebastian

     

Sign In or Register to comment.