# Design Question on join and Issue with replace missing value series operator

msacs09
Member Posts:

**55**Contributor II
Experts,

I'm trying to show the result before and after replacing missing values and also show the aggregate AVG values and somehow i get completely wrong results the moment i bring in original sample and the avg derived field.

Further, the replace missing value series operator doesn't impute the missing values here.. what could be the problem.

Thx for your time . Please see attached sample data and process

I'm trying to show the result before and after replacing missing values and also show the aggregate AVG values and somehow i get completely wrong results the moment i bring in original sample and the avg derived field.

Further, the replace missing value series operator doesn't impute the missing values here.. what could be the problem.

Thx for your time . Please see attached sample data and process

Tagged:

0

## Comments

164RM ResearchI added also an improved version of your process. As the Replace Missing Values does not need to overwrite attributes, there is no need to keep the old values with the Generate Attributes operator (and thereby removing one join). I also added the Extract Coefficients (Polynomial Fit) operator with degree = 1 in the Loop Collection. Then I uses the Replace Missing Values operator without ensure finite values, join the fitted function to the result and generate a new attribute, containing either the interpolated values, or the ones from the fit, if the interpolation was not correct. (And while doing that, I realized that the index attribute in the Extract Coefficients operator is missing in the meta information, so I will fix this with the next patch release ;-))

I attached both versions to this thread. Have fun mining ;-)

Best regards,

Fabian

55Contributor II55Contributor IIMany Thanks

164RM ResearchSo this is a bit more advanced setup, but in general you can of course use forecasting to fill missing values for the end of the time series. In fact in my improved example, I am already doing such kind of forecasting. I fit the linear function on the non-missing values, but the 'fitted' output port give me the values on all index values. So I can replace the missing values with the values from the fitted function (and as this fit is based on past values, it is basically a forecast).

You could also for example train an ARIMA model (or any other forecast model) on the non-missing values and use Apply Forecast operator to forecast the next values. With the parameter 'add combined time series' selected you don't even need to join the forecast with the original time series. I updated (and attached) the process with such an option. Had to add some macros to automatically determine the forecast horizon.

For your second question. Without knowing the true values, calculating performance (and hence deciding if interpolation or forecasting is better) is always really hard. You can of course create test data, by setting some values to missing and then evaluate the regression performance of your interpolation and your forecast (both try to predict unknown numbers, thus regression).

Hopes this helps

Best regards,

Fabian

55Contributor II## Dear @tftemme (Sir)

I have couple of questions and first of all thank you for your time.(1) So i'm kind of lost here with ARIMA on "

ystart(length of your time series)". Where do we define "ystart" in our process? Actually I see that we are passing no missing netsales values (i.e) "NETSALES_interpolated = Not Missing" to ARIMA model, so i'm kind of confused on what is it complaining about? in fact i removed all missing net sales values and I get the message as "Exception: com.rapidminer.timeseriesanalysis.exception.ArgumentsEmptyException(2) On the side note I need your expert advise on what would be the best approach to impute monthly sales based on annual sales (i.e) for some customers we only get annual revenues and we wanted to impute the monthly sales based on the available annual sales, we have annual sales from past 4 years? should we just average and divide that annual sales values by 12? Is that the right thing, just wanted to see if there is any better process out there?

As Always thank you for your value input.

s

164RM ResearchFor everyone reading also this thread, @msacs09 send me a pm with his data, which cannot be shared publicly. But an Error occured saying:

"The combination of the parameters p,d,q dependent to the length of the series is not allowed. p: 1, d: 0, q: 0, length: 5.

The given parameters p,d,q ar not valid for the given series to apply HannanRissanen (HR) to estimate Start Parameters. the condition ystart < (length -d) is not fulfilled: length: 5, d:0, q:0, p:1 ystart = Math.max(maxOrderOfInitialARProcess + q, p) = 6, maxOrderOfInitialARProcess of HR = 6"

I have to say, although the error message is not wrong it is way to complex and confusing and will be replaced by a proper one in the future. Back to the concrete question:

1) Several variables mentioned in this error message are variables internally used in the ARIMA fitting procedure, so the User cannot do anything about them. The error message basically says there are not enough Examples to train the ARIMA model. The correct condition for the User would be:

length - d > max(q+6 , p) ; with length = length of the time series.

So the error is not about missing values, but about not enough Examples at all. After filtering the missing values (with the Filter Examples operator) there are less than 7 Examples left, which is not enough to fit any ARIMA model.

But even 7 Examples are also such a small number that I would not train an ARIMA model on them.

With a such small number of Examples I would suggest to use the fitted linear function as a forecasting/replacement method.

For the error message which says: "Exception: com.rapidminer.timeseriesanalysis.exception.ArgumentsEmptyException

2) Dividing the annual Sales by 12 is a proper way to do this. Keep in mind that you don't add any information, so the only assumption you can do about the monthly Sales are that they are all the same for the year. So dividing by 12 is a reasonable way to go.

Best regards,

Fabian

55Contributor IIThank you for the feedback. I have sent you the data privately. Actually I have filtered my data set to only 1 missing interval and i get the below error. So the confusion is, out of 24 hour intervals (2 years) i do have quite a few id's that do not have any missing intervals about 363 unique id's with 24 months of intervals,which is clearly more than 7 examples, so i'm kind of confused here

164RM ResearchThis seems to be indeed a bug. I have to look into this, what is going wrong there. Thanks for finding it.

Best regards,

Fabian

2,959Community Manager