Building a predictive decision tree while excluding historical attributes

pyearickpyearick Member Posts: 3 Contributor I
edited November 2018 in Help
All,

I am trying to use RapidMiner to build a predictive decision tree. Currently, I have a process that imports historical shipment data for a number of products along with some additional attributes. These other attributes make the product more or less attractive to customers (age, color, size...).

Before my import I categorize the historical, numerical shipment data into 5 buckets called ShipCats - from "very low shipments" (<1000) to "very high shipments" (>10000) so I can use a decision tree in RapidMiner. In addition to the ShipCats attribute, each experiment that I import has a FYDate attribute, which is a date time field along with the shipment results showing the shipments of a product in that year (example: 2010->1549|2011->1722|2012->1999...). The resulting decision tree from RapidMiner, I'm sure, is correct but includes that FYDate attribute.

image

I am looking to predict ShipCats for new products from a user entry of most of the other attributes that were used to create the decision tree but not the one they can't affect, FYDate. The FYDate, of course, would be the current year.

Do I need to model the historical information first and somehow feed that input into a decision tree operator that only includes variables that can reasonably be chosen?

Thanks very much for this software and your help!

Pat
Tagged:

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    If FYDate just contains year then I wouldn't include it in your modelling as it wouldn't be very useful for your future predictions. 
    If it contains the full shipment date then maybe use Date to Numerical operator to convert it into quarter by year, that way you can see if seasonality affects your tree. 

    Otherwise, yes remove it. 
  • pyearickpyearick Member Posts: 3 Contributor I
    JEdward wrote:

    If FYDate just contains year then I wouldn't include it in your modelling as it wouldn't be very useful for your future predictions. 
    If it contains the full shipment date then maybe use Date to Numerical operator to convert it into quarter by year, that way you can see if seasonality affects your tree. 

    Otherwise, yes remove it. 
    So the yearly information is by year, it could be by quarter or by month and I feel that it is important to include in a model. The increase or decrease of shipments over history is what we are observing. Otherwise, we don't know if the product with particular attributes is more successful than one with other chosen attributes. My goal is to try to build a decision tree to determine which products to build next based on our past results incorporating certain attributes and not others.

    My question is how do I approach building a decision tree that incorporates that history without including an FYDates in the resulting decision tree?

    Thank-you for your quick reply!

    Pat
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    I would recommend to set the role of FYDate to a special role. Simply use Set Role for this and type anything into the role box.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • pyearickpyearick Member Posts: 3 Contributor I
    Martin,

    Thank-you, that took FYDate out of the decision tree result. Does marking FYDate with a special role adversely affect it's use in the decision tree processing? In other words, is the historical sales data still be considered?
    Martin Schmitz wrote:

    I would recommend to set the role of FYDate to a special role. Simply use Set Role for this and type anything into the role box.

    Best,
    Martin
Sign In or Register to comment.