Options

Design Question on join and Issue with replace missing value series operator

msacs09msacs09 Member Posts: 55 Contributor II
Experts,

I'm trying to show the result before and after replacing missing values and also show the aggregate AVG values and somehow i get completely wrong results the moment i bring in original sample and the avg derived field. 

Further, the replace missing value series operator doesn't impute the missing values here.. what could be the problem.

Thx for your time . Please see attached sample data and process
Tagged:
0
0 votes

Fixed and Released · Last Updated

Fixed in 9.2.1. TSE-78

Comments

  • Options
    msacs09msacs09 Member Posts: 55 Contributor II
    @tftemme Waow Terrific sir. Excellent. Thank you for your valuable time.
  • Options
    msacs09msacs09 Member Posts: 55 Contributor II
    edited February 2019
    @tftemme a quick followup. Is there a way we can combine this with a forecast/prediction, specifically for the values that cannot be interpolated and needs extrapolation?  Further, is there a way we can fetch performance metrics by performing some back testing to evaluate if interpolation or a forecast is better for a given data/time?

    Many Thanks
  • Options
    tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164 RM Research
    Hi @msacs09

    So this is a bit more advanced setup, but in general you can of course use forecasting to fill missing values for the end of the time series. In fact in my improved example, I am already doing such kind of forecasting. I fit the linear function on the non-missing values, but the 'fitted' output port give me the values on all index values. So I can replace the missing values with the values from the fitted function (and as this fit is based on past values, it is basically a forecast).

    You could also for example train an ARIMA model (or any other forecast model) on the non-missing values and use Apply Forecast operator to forecast the next values. With the parameter 'add combined time series' selected you don't even need to join the forecast with the original time series. I updated (and attached) the process with such an option. Had to add some macros to automatically determine the forecast horizon.

    For your second question. Without knowing the true values, calculating performance (and hence deciding if interpolation or forecasting is better) is always really hard. You can of course create test data, by setting some values to missing and then evaluate the regression performance of your interpolation and your forecast (both try to predict unknown numbers, thus regression). 

    Hopes this helps
    Best regards,
    Fabian
  • Options
    msacs09msacs09 Member Posts: 55 Contributor II
    edited February 2019

    Dear @tftemme  (Sir)

    I have couple of questions and first of all thank you for your time.

     (1) So i'm kind of lost here with ARIMA on "ystart (length of your time series)". Where do we define "ystart" in our process? Actually I see that we are passing no missing netsales values (i.e)  "NETSALES_interpolated = Not Missing" to ARIMA model, so i'm kind of confused on what is it complaining about? in fact i removed all missing net sales values and I get the message as "Exception: com.rapidminer.timeseriesanalysis.exception.ArgumentsEmptyException
    Message: Provided values array is empty"

    (2) On the side note I need your expert advise on what would be the best approach to impute monthly sales based on  annual sales (i.e) for some customers we only get annual revenues and we wanted to impute the monthly sales based on the available annual sales, we have annual sales from past 4 years? should we just average and divide that annual sales values by 12? Is that the right thing, just wanted to see if there is any better process out there?

    As Always thank you for your value input.
    s
  • Options
    tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164 RM Research
    Hi @msacs09 ,

    For everyone reading also this thread, @msacs09 send me a pm with his data, which cannot be shared publicly. But an Error occured saying:
    "The combination of the parameters p,d,q dependent to the length of the series is not allowed. p: 1, d: 0, q: 0, length: 5.
    The given parameters p,d,q ar not valid for the given series to apply HannanRissanen (HR) to estimate Start Parameters. the condition ystart < (length -d) is not fulfilled: length: 5, d:0, q:0, p:1 ystart = Math.max(maxOrderOfInitialARProcess + q, p) = 6, maxOrderOfInitialARProcess of HR = 6"

    I have to say, although the error message is not wrong it is way to complex and confusing and will be replaced by a proper one in the future. Back to the concrete question:

    1) Several variables mentioned in this error message are variables internally used in the ARIMA fitting procedure, so the User cannot do anything about them. The error message basically says there are not enough Examples to train the ARIMA model. The correct condition for the User would be:
    length - d > max(q+6 , p) ; with length = length of the time series.
    So the error is not about missing values, but about not enough Examples at all. After filtering the missing values (with the Filter Examples operator) there are less than 7 Examples left, which is not enough to fit any ARIMA model. 
    But even 7 Examples are also such a small number that I would not train an ARIMA model on them.
    With a such small number of Examples I would suggest to use the fitted linear function as a forecasting/replacement method.

    For the error message which says: "Exception: com.rapidminer.timeseriesanalysis.exception.ArgumentsEmptyException
    Message: Provided values array is empty", could you provide me again the process and the data causing this. This Error should not occur, but a corresponding UserError (red bubble) should be thrown.

    2) Dividing the annual Sales by 12 is a proper way to do this. Keep in mind that you don't add any information, so the only assumption you can do about the monthly Sales are that they are all the same for the year. So dividing by 12 is a reasonable way to go.

    Best regards,
    Fabian
  • Options
    msacs09msacs09 Member Posts: 55 Contributor II
    @tftemme  Sir,

    Thank you for the feedback. I have sent you the data privately. Actually I have filtered my data set to only 1 missing interval and i get the below error. So the confusion is, out of 24 hour intervals (2 years) i do have quite a few id's that do not have any missing intervals about 363 unique id's with 24 months of intervals,which is clearly more than 7 examples, so i'm kind of confused here 


     



  • Options
    tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 164 RM Research
    Hi @msac

    This seems to be indeed a bug. I have to look into this, what is going wrong there. Thanks for finding it.

    Best regards,
    Fabian
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    moving to Product Feedback. @tftemme have you filed a ticket?
Sign In or Register to comment.