Workflow to predict complex dataset - Best Practices

wirtcalwirtcal Member Posts: 16 Maven
edited November 2018 in Help

Hello community,

 

What are the best practices to explore a complex and unknown database and predict with accuracy a numeric value? I mean "complex" considering that the dataset contains more than 100 columns including integer attributes, real numbers, and at least 10 polynomial columns.

 

>>> I have created a repository and loaded the trainning_data and test_data, setting the data type correctly to the columns (integer, real, polynomial and label)

>>> I am using the Sample Operator to reduce the amount of data to process and save some time when I am modeling. Which other techniques can be used to be more productive when dealing with large databases that requires a lot of time to run?

>>> Then I start trying to use the Learners and realized that I don't know which is the most applicable. It is more difficult especially because of the polynomial attributes. When I tried to use some Polynomial to Binomial, there was a lack of memory to process. 

>>> Knowing that convert the polynomial attributes to binominal results in a lack of memory, I have splitted the data (using select attribute) to use partially with learners that works with polynomial attributes, and the others with a different learner - what is definitely not the correct way!

 

My *dream* plan is:

--> Load database

------> Set variables type

---------->Run some kind of Matrix Correlation (but there also polynominal fields) and Weighting

---------------> Select the most relative and important attributes to learning

------------------> Use sample operator to increase performance when modeling

---------------------> Include a Validation Operator

------------------------->Use performance operator to improve parameters. 

----------------------------->Predict

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Your dream plan looks good to me. 

     

    Have a look at the Weight By operators. Especially Weight By Gini Index and Weight by Information Gain might be helpful for your polynominal values.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • wirtcalwirtcal Member Posts: 16 Maven

    Thanks Martin!

     

    Good to know that my plan is ok!

     

    I have checked out the Weight By __ operators that you suggest, but both cannot handle with numeric label. In my tests, just Weight By Relief seemed to work to weight numerical and polynominal attributes with a numeric label.  

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    You can also use Correlation for Numerical and Gini Index for polynominal attributes. You can use Select Attributes with value_type as option to split between numerical and polynominal.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • wirtcalwirtcal Member Posts: 16 Maven

    I have followed the workflow I planned plus your suggestions, yet my predictions are far away from "acceptable" (considering r2).

     

    The last thing I tryed was :

    0) Sample the data

     

    1) Split data in nominal and numeric attributes

    - for numerical --> Weight by Correlation -> Filter Numerical Attributes by Weight

    - for nominal --> Select 2 (of 8*) attributes  --> Convert to binominal -> Convert to numerical -> Weight by Releaf -> Filter Nominal Attributes by Weight

     

    2) Join the "most relative" attributes in a new table

     - I have tried manualy different setups to define the "most relative" attributes based on performance tests, I also tried differents weight operators

     

    3) Connect this new table to the Forward Selection operator

     - Inside of it I'm splitting my data in 70% to model/learning and 30% to performance test

     

    4) Change parameters and test different regression operators.

     

    5) Get bad predictions =( 

      

    * I have selected the ones with less than 200 distinct values. There are other 6 polynominal attributes that I dont know how to take advantage of them to predict a numeric label. They have hundreds of distinct values and conversions to binominal demands memory and processor that I dont have =/ 

     

    How could I take advantage of these 6 extra nominal fields to predict a number regarding the memory limitation? What improvements/changes should I do in the process? should I start from scratch (again)?

     

    Thanks

     

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    have you tried the gradient boosted trees as learners? They are pretty nice because you do not need to do Nominal to Numerical to do regression.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • wirtcalwirtcal Member Posts: 16 Maven

    @mschmitz wrote:

     

    have you tried the gradient boosted trees as learners? (...)

    Not really..

     

    I was currently modeling with rapidminer 5.4 (favorite) and 6.X (often crashes on OSX) that I had already installed in my computer for years.

     

    I downloading right now the latest version of rapidminer studio to check it out. Looks like gradient boosted trees was released 7.x right? I hope it (or another new learner) helps me to get "acceptable" predictions.

     

    Thank you Martin!

     

    Rafael

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder

    Hi,

     

    Yes, Gradient Boosted Trees have been added in RapidMiner 7.2 along with some other nice new learners (incl. Deep Learning, a new Logistic Regression, and Generalized Linear Models).  They all delivered very good results in the projects we have used them for.

     

    Best,

    Ingo

Sign In or Register to comment.