turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community Home
- :
- Product Help
- :
- RapidMiner Studio Forum
- :
- Linear Regression, Trend Analysis and Prediction

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-11-2013 07:12 PM

06-11-2013 07:12 PM

Hi,

I'd like to develop a model that predicts future sales quantities based on historic values.

**Example**

Let's say for Article A I know the weekly sales quantities:

Week 1: 10 pcs.

Week 2: 20 pcs.

Week 3: 30 pcs.

Week 4: 40 pcs.

[...]

Week 20: 200 pcs.

I would like to calculate the trend line and predict the consumption for the next 4 weeks, which in this simple example would be:

Prediction week 21: 210 pcs.

[...]

Prediction week 24: 240 pcs.

In Excel, one would create a chart and insert a trend line, showing also the formula for the trendline. Based on the formula one can calculate the predicted values along the trend line.

I know how I could calculate it manually (as shown here:

http://www.wired.com/wiredscience/2011/01/linear-regression-by-hand/ in a "Generate Attribute" operator), but it looks cumbersome and I believe there has to be a simpler way.

I checked out the "Linear Regression" operator, but it does not seem to be what I'm looking for.

Simply put, I'm trying to calculate m and b in the famous formula "y = mx + b"

I'd be very thankful for any advice.

I'd like to develop a model that predicts future sales quantities based on historic values.

Let's say for Article A I know the weekly sales quantities:

Week 1: 10 pcs.

Week 2: 20 pcs.

Week 3: 30 pcs.

Week 4: 40 pcs.

[...]

Week 20: 200 pcs.

I would like to calculate the trend line and predict the consumption for the next 4 weeks, which in this simple example would be:

Prediction week 21: 210 pcs.

[...]

Prediction week 24: 240 pcs.

In Excel, one would create a chart and insert a trend line, showing also the formula for the trendline. Based on the formula one can calculate the predicted values along the trend line.

I know how I could calculate it manually (as shown here:

http://www.wired.com/wiredscience/2011/01/linear-regression-by-hand/ in a "Generate Attribute" operator), but it looks cumbersome and I believe there has to be a simpler way.

I checked out the "Linear Regression" operator, but it does not seem to be what I'm looking for.

Simply put, I'm trying to calculate m and b in the famous formula "y = mx + b"

I'd be very thankful for any advice.

7 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-12-2013 06:05 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-13-2013 01:16 AM

06-13-2013 01:16 AM

Hi Marius,

thank you very much for your reply. I did as you suggested, and indeed I got the formula. Is there a way to use the parameters for further processing? What I try to achieve is this:

I'd like to calculate the weekly sales for the last 52 weeks for each article, then calculate the linear regression. Then, I want to

a) predict sales for the next week (or 4 weeks)

b) find articles with strongest in- or decrease.

The generated table should look like this:

I'd like to generate this table for all articles. Such table would help to both predict future sales, and to group articles by development (e. g. "increasing" and "decreasing".

Thanks again for your advice!

thank you very much for your reply. I did as you suggested, and indeed I got the formula. Is there a way to use the parameters for further processing? What I try to achieve is this:

I'd like to calculate the weekly sales for the last 52 weeks for each article, then calculate the linear regression. Then, I want to

a) predict sales for the next week (or 4 weeks)

b) find articles with strongest in- or decrease.

The generated table should look like this:

Customer | Week 0 | Week 1 | Week 2 | Week n | Week 52 | Parameter m | Parameter b | Predicted Week 53 |

Customer A | 0 | 100 | 200 | nnn | 520 | 10 | 0 | 530 |

I'd like to generate this table for all articles. Such table would help to both predict future sales, and to group articles by development (e. g. "increasing" and "decreasing".

Thanks again for your advice!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-27-2013 08:19 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

06-28-2013 03:06 PM

06-28-2013 03:06 PM

Hi rowan.g,

well, I did not find an operator that would allow me to do this, I will now simply create an attribute myself, and write a formula manually. It's a bit cumbersome, but actually not too complicated. I've done this before in MS Access creating a VBA function, following this pattern:

http://www.google.ca/imgres?imgurl=&imgrefurl=http%3A%2F%2Fwww.docstoc.com%2Fdocs%2F46599202%2FSumma...

One of the problems I encountered so far though is this: I want to calculate the linear regression based on weekly sales. My invoicing information shows the invoicing date, and it's simple to calculate the corresponding calendar week.

However, for invoicing dates December 30 and 31, 2012, the result is week 1 in the following year 2013. It took me a while to understand that, but when one looks at a weekly calendar, this is how it's actually being shown. Just like the before mentioned dates: They do not belong to week 52, and there's no week 53, so they have to be part of week 1 of the following year.

However, it just feels wrong to me to say sales of 31st of december belongs to week 1. I didn't work it out yet, but probably what I will do is that I'll manually "force" all last days of december that "officially" belong to week 1 into the last week of december anyway. (Something like "If month = December and week = 1, then week = 52" and year = year-1)

If you are working on something similar, just let me know, maybe we can share some results - and some work

well, I did not find an operator that would allow me to do this, I will now simply create an attribute myself, and write a formula manually. It's a bit cumbersome, but actually not too complicated. I've done this before in MS Access creating a VBA function, following this pattern:

http://www.google.ca/imgres?imgurl=&imgrefurl=http%3A%2F%2Fwww.docstoc.com%2Fdocs%2F46599202%2FSumma...

One of the problems I encountered so far though is this: I want to calculate the linear regression based on weekly sales. My invoicing information shows the invoicing date, and it's simple to calculate the corresponding calendar week.

However, for invoicing dates December 30 and 31, 2012, the result is week 1 in the following year 2013. It took me a while to understand that, but when one looks at a weekly calendar, this is how it's actually being shown. Just like the before mentioned dates: They do not belong to week 52, and there's no week 53, so they have to be part of week 1 of the following year.

However, it just feels wrong to me to say sales of 31st of december belongs to week 1. I didn't work it out yet, but probably what I will do is that I'll manually "force" all last days of december that "officially" belong to week 1 into the last week of december anyway. (Something like "If month = December and week = 1, then week = 52" and year = year-1)

If you are working on something similar, just let me know, maybe we can share some results - and some work

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

07-15-2013 10:06 PM

07-15-2013 10:06 PM

The operator that you seek is called the "Windowing" operator.

I have a process laid out as follows:

Read Database (contains stock market data) -> Set Role (I choose my label for prediction) -> Windowing (This is the key) -> Neural Network -> Apply Model -> Both outputs of apply model are connected to the results ports of the process.

The windowing operator is a little tricky, however, the operator will allow the model to iterate through your dataset and sample a "window" of records. So I've setup the Windowing operator to look at 1 example, and increment by one example. This will effectively make it read the first example, then the second.. etc.. There is also a setting where you can specify the number of records in which you want to predict after the window.

I will provide the code of my process in case my explanation of the Windowing operator is not clear.

From the help:

Synopsis

Creates examples from a multivariate value series data set by windowing the input data.

Description

This operator transforms a given example set containing series data into a new example set containing single valued examples. For this purpose, windows with a specified window and step size are moved across the series and the attribute value lying horizon values after the window end is used as label which should be predicted. In contrast to the Series2WindowExamples operator, this operator can also handle multivariate series data. In order to specify the dimension which should be predicted, one must use the parameter "label_dimension" (counting starts at 0). If you want to predict all dimensions of your multivariate series you must setup several process definitions with different label dimensions, one for each dimension.

The series data must be given as ExampleSet. The parameter "series_representation" defines how the series data is represented by the ExampleSet:

encode_series_by_examples

the series index variable (e.g. time) is encoded by the examples, i.e. there is a set of attributes (one for each dimension of the multivariate series) and a set of examples. Each example encodes the value vector for a new time point, each attribute value represents another dimension of the multivariate series. * encode_series_by_attributes

the series index variable (e.g. time) is encoded by the attributes, i.e. there is a set of examples (one for each dimension of the multivariate series) and a set of attributes. The set of attribute values for all examples encodes the value vector for a new time point, each example represents another dimension of the multivariate series.

Please note that the encoding as examples is usually more efficient with respect to the memory usage.

Input

example set input: expects: ExampleSetMetaData: #examples: = 0; #attributes: 0

, expects: ExampleSet, expects: ExampleSet

Output

example set output:

original:

Parameters

series representation: This parameter defines how the series values will be represented. Range: encode_series_by_examples, encode_series_by_attributes; default: encode_series_by_examples

horizon: The prediction horizon, i.e. the distance between the last window value and the value to predict. Range: integer; 0-+?; default: 0

window size: The width of the used windows. Range: integer; 1-+?; default: 100

step size: The step size of the used windows, i.e. the distance between the first values Range: integer; 1-+?; default: 1

create single attributes: Indicates if the result example set should use single attributes instead of series attributes. Range: boolean; default: true

create label: If checked a label will be created Range: boolean; default: false

select label by dimension: If checked the label will be created using an index specifying the attribute Range: boolean; default: false

label attribute: The name of the attribute which should be used for creating the label values. Range: string

label dimension: The dimension which should be used for creating the label values (counting starts with 0). Range: integer; 0-+?

add incomplete windows: Create windows for all examples, which causes missings in the first windows. Range: boolean; default: false

I hope this helps.

Please check out my blogs for more RapidMiner and data mining tutorials.

http://completebusinessanalytics.com/

http://refactorthis.net/

I have a process laid out as follows:

Read Database (contains stock market data) -> Set Role (I choose my label for prediction) -> Windowing (This is the key) -> Neural Network -> Apply Model -> Both outputs of apply model are connected to the results ports of the process.

The windowing operator is a little tricky, however, the operator will allow the model to iterate through your dataset and sample a "window" of records. So I've setup the Windowing operator to look at 1 example, and increment by one example. This will effectively make it read the first example, then the second.. etc.. There is also a setting where you can specify the number of records in which you want to predict after the window.

I will provide the code of my process in case my explanation of the Windowing operator is not clear.

From the help:

Synopsis

Creates examples from a multivariate value series data set by windowing the input data.

Description

This operator transforms a given example set containing series data into a new example set containing single valued examples. For this purpose, windows with a specified window and step size are moved across the series and the attribute value lying horizon values after the window end is used as label which should be predicted. In contrast to the Series2WindowExamples operator, this operator can also handle multivariate series data. In order to specify the dimension which should be predicted, one must use the parameter "label_dimension" (counting starts at 0). If you want to predict all dimensions of your multivariate series you must setup several process definitions with different label dimensions, one for each dimension.

The series data must be given as ExampleSet. The parameter "series_representation" defines how the series data is represented by the ExampleSet:

encode_series_by_examples

the series index variable (e.g. time) is encoded by the examples, i.e. there is a set of attributes (one for each dimension of the multivariate series) and a set of examples. Each example encodes the value vector for a new time point, each attribute value represents another dimension of the multivariate series. * encode_series_by_attributes

the series index variable (e.g. time) is encoded by the attributes, i.e. there is a set of examples (one for each dimension of the multivariate series) and a set of attributes. The set of attribute values for all examples encodes the value vector for a new time point, each example represents another dimension of the multivariate series.

Please note that the encoding as examples is usually more efficient with respect to the memory usage.

Input

example set input: expects: ExampleSetMetaData: #examples: = 0; #attributes: 0

, expects: ExampleSet, expects: ExampleSet

Output

example set output:

original:

Parameters

series representation: This parameter defines how the series values will be represented. Range: encode_series_by_examples, encode_series_by_attributes; default: encode_series_by_examples

horizon: The prediction horizon, i.e. the distance between the last window value and the value to predict. Range: integer; 0-+?; default: 0

window size: The width of the used windows. Range: integer; 1-+?; default: 100

step size: The step size of the used windows, i.e. the distance between the first values Range: integer; 1-+?; default: 1

create single attributes: Indicates if the result example set should use single attributes instead of series attributes. Range: boolean; default: true

create label: If checked a label will be created Range: boolean; default: false

select label by dimension: If checked the label will be created using an index specifying the attribute Range: boolean; default: false

label attribute: The name of the attribute which should be used for creating the label values. Range: string

label dimension: The dimension which should be used for creating the label values (counting starts with 0). Range: integer; 0-+?

add incomplete windows: Create windows for all examples, which causes missings in the first windows. Range: boolean; default: false

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<process version="5.3.008">

<context>

<input/>

<output/>

<macros/>

</context>

<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">

<parameter key="parallelize_main_process" value="true"/>

<process expanded="true">

<operator activated="true" class="retrieve" compatibility="5.3.008" expanded="true" height="60" name="Retrieve MSFT 10 year stock data" width="90" x="45" y="75">

<parameter key="repository_entry" value="//Local Repository/data/Stock Related/Microsoft Stock Data 10 years/MSFT 10 year stock data"/>

</operator>

<operator activated="true" class="set_role" compatibility="5.3.008" expanded="true" height="76" name="Set Role" width="90" x="179" y="75">

<parameter key="attribute_name" value="date"/>

<parameter key="target_role" value="id"/>

<list key="set_additional_roles"/>

</operator>

<operator activated="true" class="series:windowing" compatibility="5.3.000" expanded="true" height="76" name="Windowing" width="90" x="313" y="75">

<parameter key="window_size" value="1"/>

<parameter key="create_label" value="true"/>

<parameter key="label_attribute" value="close"/>

<parameter key="horizon" value="5"/>

</operator>

<operator activated="true" class="neural_net" compatibility="5.3.008" expanded="true" height="76" name="Neural Net" width="90" x="447" y="30">

<list key="hidden_layers"/>

</operator>

<operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model" width="90" x="648" y="30">

<list key="application_parameters"/>

</operator>

<connect from_op="Retrieve MSFT 10 year stock data" from_port="output" to_op="Set Role" to_port="example set input"/>

<connect from_op="Set Role" from_port="example set output" to_op="Windowing" to_port="example set input"/>

<connect from_op="Windowing" from_port="example set output" to_op="Neural Net" to_port="training set"/>

<connect from_op="Neural Net" from_port="model" to_op="Apply Model" to_port="model"/>

<connect from_op="Neural Net" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>

<connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>

<connect from_op="Apply Model" from_port="model" to_port="result 2"/>

<portSpacing port="source_input 1" spacing="0"/>

<portSpacing port="sink_result 1" spacing="0"/>

<portSpacing port="sink_result 2" spacing="0"/>

<portSpacing port="sink_result 3" spacing="0"/>

</process>

</operator>

</process>

I hope this helps.

Please check out my blogs for more RapidMiner and data mining tutorials.

http://completebusinessanalytics.com/

http://refactorthis.net/

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

07-16-2013 06:37 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

07-31-2013 04:10 PM

07-31-2013 04:10 PM

Hi budbjames,

first of all, please accept my apology for checking back so late! I somehow missed the "notify" function, and it also slipped a bit my mind to check back - sorry about that!

I do appreciate your information, and that you even put together a model.

I will check it out shortly, and provide some feedback here.

Again, sorry for my long absence!

first of all, please accept my apology for checking back so late! I somehow missed the "notify" function, and it also slipped a bit my mind to check back - sorry about that!

I do appreciate your information, and that you even put together a model.

I will check it out shortly, and provide some feedback here.

Again, sorry for my long absence!