**RapidMiner 9.7 is Now Available**

### Lots of amazing new improvements including true version control! Learn more about what's new here.

### CLICK HERE TO DOWNLOAD

# "Linear Regression, Trend Analysis and Prediction"

MacPhotoBiker
Member Posts:

**60**Contributor II
Hi,

I'd like to develop a model that predicts future sales quantities based on historic values.

Let's say for Article A I know the weekly sales quantities:

Week 1: 10 pcs.

Week 2: 20 pcs.

Week 3: 30 pcs.

Week 4: 40 pcs.

[...]

Week 20: 200 pcs.

I would like to calculate the trend line and predict the consumption for the next 4 weeks, which in this simple example would be:

Prediction week 21: 210 pcs.

[...]

Prediction week 24: 240 pcs.

In Excel, one would create a chart and insert a trend line, showing also the formula for the trendline. Based on the formula one can calculate the predicted values along the trend line.

I know how I could calculate it manually (as shown here:

http://www.wired.com/wiredscience/2011/01/linear-regression-by-hand/ in a "Generate Attribute" operator), but it looks cumbersome and I believe there has to be a simpler way.

I checked out the "Linear Regression" operator, but it does not seem to be what I'm looking for.

Simply put, I'm trying to calculate m and b in the famous formula "y = mx + b"

I'd be very thankful for any advice.

I'd like to develop a model that predicts future sales quantities based on historic values.

**Example**Let's say for Article A I know the weekly sales quantities:

Week 1: 10 pcs.

Week 2: 20 pcs.

Week 3: 30 pcs.

Week 4: 40 pcs.

[...]

Week 20: 200 pcs.

I would like to calculate the trend line and predict the consumption for the next 4 weeks, which in this simple example would be:

Prediction week 21: 210 pcs.

[...]

Prediction week 24: 240 pcs.

In Excel, one would create a chart and insert a trend line, showing also the formula for the trendline. Based on the formula one can calculate the predicted values along the trend line.

I know how I could calculate it manually (as shown here:

http://www.wired.com/wiredscience/2011/01/linear-regression-by-hand/ in a "Generate Attribute" operator), but it looks cumbersome and I believe there has to be a simpler way.

I checked out the "Linear Regression" operator, but it does not seem to be what I'm looking for.

Simply put, I'm trying to calculate m and b in the famous formula "y = mx + b"

I'd be very thankful for any advice.

Tagged:

0

## Answers

1,869UnicornBest regards,

Marius

60Contributor IIthank you very much for your reply. I did as you suggested, and indeed I got the formula. Is there a way to use the parameters for further processing? What I try to achieve is this:

I'd like to calculate the weekly sales for the last 52 weeks for each article, then calculate the linear regression. Then, I want to

a) predict sales for the next week (or 4 weeks)

b) find articles with strongest in- or decrease.

The generated table should look like this:

CustomerWeek 0Week 1Week 2Week nWeek 52Parameter mParameter bPredicted Week 53530Thanks again for your advice!

47Contributor IIDid you manage to solve this one?

Cheers,

60Contributor IIwell, I did not find an operator that would allow me to do this, I will now simply create an attribute myself, and write a formula manually. It's a bit cumbersome, but actually not too complicated. I've done this before in MS Access creating a VBA function, following this pattern:

http://www.google.ca/imgres?imgurl=&imgrefurl=http%3A%2F%2Fwww.docstoc.com%2Fdocs%2F46599202%2FSummary-formula-sheet-for-simple-linear-regression&h=0&w=0&sz=1&tbnid=9K_TXEI_eOUcKM&tbnh=255&tbnw=197&zoom=1&docid=bUSnrnOKeCjASM&ei=JdnNUZPXJMHXygGw-ICQBg&ved=0CAEQsCU

One of the problems I encountered so far though is this: I want to calculate the linear regression based on weekly sales. My invoicing information shows the invoicing date, and it's simple to calculate the corresponding calendar week.

However, for invoicing dates December 30 and 31, 2012, the result is week 1 in the following year 2013. It took me a while to understand that, but when one looks at a weekly calendar, this is how it's actually being shown. Just like the before mentioned dates: They do not belong to week 52, and there's no week 53, so they have to be part of week 1 of the following year.

However, it just feels wrong to me to say sales of 31st of december belongs to week 1. I didn't work it out yet, but probably what I will do is that I'll manually "force" all last days of december that "officially" belong to week 1 into the last week of december anyway. (Something like "If month = December and week = 1, then week = 52" and year = year-1)

If you are working on something similar, just let me know, maybe we can share some results - and some work

1Contributor II have a process laid out as follows:

Read Database (contains stock market data) -> Set Role (I choose my label for prediction) -> Windowing (This is the key) -> Neural Network -> Apply Model -> Both outputs of apply model are connected to the results ports of the process.

The windowing operator is a little tricky, however, the operator will allow the model to iterate through your dataset and sample a "window" of records. So I've setup the Windowing operator to look at 1 example, and increment by one example. This will effectively make it read the first example, then the second.. etc.. There is also a setting where you can specify the number of records in which you want to predict after the window.

I will provide the code of my process in case my explanation of the Windowing operator is not clear.

From the help:

Synopsis

Creates examples from a multivariate value series data set by windowing the input data.

Description

This operator transforms a given example set containing series data into a new example set containing single valued examples. For this purpose, windows with a specified window and step size are moved across the series and the attribute value lying horizon values after the window end is used as label which should be predicted. In contrast to the Series2WindowExamples operator, this operator can also handle multivariate series data. In order to specify the dimension which should be predicted, one must use the parameter "label_dimension" (counting starts at 0). If you want to predict all dimensions of your multivariate series you must setup several process definitions with different label dimensions, one for each dimension.

The series data must be given as ExampleSet. The parameter "series_representation" defines how the series data is represented by the ExampleSet:

encode_series_by_examples

the series index variable (e.g. time) is encoded by the examples, i.e. there is a set of attributes (one for each dimension of the multivariate series) and a set of examples. Each example encodes the value vector for a new time point, each attribute value represents another dimension of the multivariate series. * encode_series_by_attributes

the series index variable (e.g. time) is encoded by the attributes, i.e. there is a set of examples (one for each dimension of the multivariate series) and a set of attributes. The set of attribute values for all examples encodes the value vector for a new time point, each example represents another dimension of the multivariate series.

Please note that the encoding as examples is usually more efficient with respect to the memory usage.

Input

example set input: expects: ExampleSetMetaData: #examples: = 0; #attributes: 0

, expects: ExampleSet, expects: ExampleSet

Output

example set output:

original:

Parameters

series representation: This parameter defines how the series values will be represented. Range: encode_series_by_examples, encode_series_by_attributes; default: encode_series_by_examples

horizon: The prediction horizon, i.e. the distance between the last window value and the value to predict. Range: integer; 0-+?; default: 0

window size: The width of the used windows. Range: integer; 1-+?; default: 100

step size: The step size of the used windows, i.e. the distance between the first values Range: integer; 1-+?; default: 1

create single attributes: Indicates if the result example set should use single attributes instead of series attributes. Range: boolean; default: true

create label: If checked a label will be created Range: boolean; default: false

select label by dimension: If checked the label will be created using an index specifying the attribute Range: boolean; default: false

label attribute: The name of the attribute which should be used for creating the label values. Range: string

label dimension: The dimension which should be used for creating the label values (counting starts with 0). Range: integer; 0-+?

add incomplete windows: Create windows for all examples, which causes missings in the first windows. Range: boolean; default: false I hope this helps.

Please check out my blogs for more RapidMiner and data mining tutorials.

http://completebusinessanalytics.com/

http://refactorthis.net/

47Contributor II60Contributor IIfirst of all, please accept my apology for checking back so late! I somehow missed the "notify" function, and it also slipped a bit my mind to check back - sorry about that!

I do appreciate your information, and that you even put together a model.

I will check it out shortly, and provide some feedback here.

Again, sorry for my long absence!