Options

Dependency Analysis Exercise for Newbie

66SS39666SS396 Member Posts: 2 Contributor I
I am just getting started with Data Mining and Rapid Miner. I have been through the tutorials, done some web reading, played with the tool and am ready for an exercise.

So I found a web article where a guy did a dependency analysis on some specific stocks.  He picked about 7 stocks he felt should have some dependency on each other.  For this experiment he was only going to use Open Price to determine if the opening price of a given stock had dependy on one or more of the other stocks opening price. He suggested that he used M5P with Regression Tree and the 10-fold cross validation.  This seemed like and exercise that would be of interest to me and he provided his results and the time period he used for the data. 

The problem is I dont know how his input data was structured such as:

Stock Name, Date, Open Price

or

Date, stock A open price, stock B open price, stock c open price etc.

In both cases I get the following error:

Error in: W-M5P (W-M5P) W-M5P caused an error: weka.classifiers.functions.LinearRegression: Cannot handle multi-valued nominal class! An external program or library has reported an error. Please see the documentation of this program or library for further information.

So this looks like I clearly didnt understand the format of his input data.

So I then decided to play with a decision tree of this data and with the second data format It shows me a nice decision tree of some of the stocks and represents the data I input.

Any help on understanding what I am doing wrong with the M5P exercise would be great or other suggestions for how to accomplish the same exercise would also be appreciated.

thanks in advance for you help.

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    obviously something has gone wrong with your data input. Stock prediction nearly ever is predicting the price the next day, week or year. The price is a numerical value and hence we have a regression task on hand. Thats why the guy suggested using this algorithm, its a regression method. Other algorithms like DecisionTrees are designed for classification tasks, where only a finite and discrete set of classes exists, represented by some nominal values as "red", "green", "blue".

    If this error message appears, then you tried to use a regression method on a data set having a nominal label. Thats why DecisionTree worked fine.

    You should check if your data is correctly imported (Numerical values are stored in a numeric attribute) and if you set the label correctly (since predicting the stock name is not very interesting.

    Greetings,
      Sebastian
  • Options
    66SS39666SS396 Member Posts: 2 Contributor I
    Sebastian,

    Thanks so much for you response.  That makes sense to me so if I am trying to predict the opening price of a specific stock then I would use the format:

    Date, stock a open price, stock b open price, stock c open price...etc and set the label to stock a open price.  - is this correct.  So the thinking is that the regression method would be comparing each of the other stock open prices by date to predict the opening price of stock a.

    this article led me to believe he used the data as it was provided from the web which is typically in the format :

    stock ticker, date, open price - but that doesnt seem obvious to me that it derived the same results of predicting opening price of a specific stock.  Thoughts on what the results of this format would tell me?

    Thanks for you patience, I realize this is probably very basic but I am really just starting to get into data mining.  My background has been primarily focused on Business Intelligence which I am finding to be very different from this process.

    Again thanks so much for assistance.

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    thats basically correct if you really want to predict the open price of a stock a from the open prices of other stocks on the SAME day. This way you will find dependencies, but you will not look in the future. To do this, you would have to bring your data in a format like this:
    IDLabelregular attributeregular attribute
    DateStock Price A of next dayStock Price B of this dayStock Price C of this day
    And so on. In the Preprocessing/Series branch of the operators tree in rapid miner are operators transforming the data format you have got to this one.

    Greetings,
      Sebastian
Sign In or Register to comment.