Options

Hi, I tried to implement a test case in rapid miner.

nn_herenn_here Member Posts: 31 Contributor I
Hi,

I tried to  implement a test case in rapid miner.
1.Loaded the training data,since it's a regression model, tried  with linear regression ..
2.After preprocessing the data and removing unnecessary column values, and applying the model and performance it has produced a result of decent accuracy.
3Now wanted to apply this model onto the testing data and check the performance and related attributes.
Kindly refer the attached doc containing the flow of operators used for both training and testing dataset for reaching the target values.
 I have retrieved train and test data again and then gave used cross validation and applied the model
Can you please tell me if there is any way where the apply model only can be saved somewhere and then invoke it by giving the input as test data only ,without considering the training data. i have applied the entire operators used in the training data to testing data also which i feel is redundant .
Kindly help me in clarifying the same.
Thanks in advance.

Best Answer

  • Options
    CKönigCKönig Administrator, Moderator, Employee, Member Posts: 70 RM Team Member
    Solution Accepted
    As a general rule, you should be applying the same preprocessing steps on both the training dataset and the testing dataset. This can make a huge difference, e.g. if you normalize the training dataset and the model expects values around 0, and then you feed it huge unnormalized numbers. It usually makes sense to put the preprocessing steps in a separate process that you can drag and drop into the training and scoring process. This also makes maintaining them much easier, since you only have to make changes in one place.

Answers

  • Options
    ceaperezceaperez Member Posts: 522 Unicorn
    Hi @nn_here

    After the validation of your model with the cross-validation operator you can use the apply model operator.
    The Apply model operator have two entries mod and uni. Connect the mod output from the Cross-validation operator to the input of the Apply model operator and the validation dataset to the uni input port of the Apply model operator.

    best, 

    Cesar


  • Options
    nn_herenn_here Member Posts: 31 Contributor I
    Thankyou for the quick help. Can you tell me in this case also we have loaded deals and deals2(which i assume is of training and testing data respectively, kindly correct me if i am wrong).So every time rapid miner expects us to load both the data sets to get the prediction of 2nd dataset..? 
    Thanks and regards,
  • Options
    ceaperezceaperez Member Posts: 522 Unicorn
    Hi @nn_here
    you welcome. Yes, you are right, the Deals dataset is for training and testing and the Deals(2) dataset is for validation. Another option is to split your dataset, 90% for training and testing, and the use the other 10% for validation. 

    Best, 

    Cesar
  • Options
    nn_herenn_here Member Posts: 31 Contributor I
    Hi,
    Thank you once again for the update. In a nutshell, in  rapidminer,we have to load two datasets training and testing in the same process for validating the performance of testing data,There is no option like we can save a model trained for training data and later on we can pullout the model alone for getting the result of testing data(without placing  training data  in the same process).Kindly confirm if my understanding is fine or i miss any operator that would do the same intended function i need.
    Thanks and regards.
  • Options
    ceaperezceaperez Member Posts: 522 Unicorn
    Hi again @nn_here

    After the testing process you can save your model and then import it into other processes, for example as part of a validation process. 
    This is very easy, after you have saved the model in your repository, just drag and drop the process into a new process and connect the output ports to other operators as you need.



  • Options
    nn_herenn_here Member Posts: 31 Contributor I
    Hi,
    Thank you for   clarifying the doubt. Will try this out !
    Thanks and regards.
  • Options
    nn_herenn_here Member Posts: 31 Contributor I
    Hi,

    As mentioned ,i have saved the model of training set as a separate process and that of testing as another process .Then in  new process i dragged these processes and combined with apply model.But the  result  we got as part of this is far different from the one which we got when these processes were created as a single process. Is that a possible case. or am  i missing something here too .kindly find the latest doc along with this post, the original doc is already uploaded. Kindly help me in clarifying the sameFor your reference uploading both files again. result doc contains the latest changes made and rapidminer crossvalidation consists of the original process created.
    Thanks and regards.
  • Options
    ceaperezceaperez Member Posts: 522 Unicorn
    Hi @nn_here,

    If you are using the same datasets in both cases, the results must be similar. Can you share your process and dataset?

    Best, 

    Cesar
  • Options
    nn_herenn_here Member Posts: 31 Contributor I
    Hi, Thankyou for the update provided.i will cross   verify the process i have created.
  • Options
    nn_herenn_here Member Posts: 31 Contributor I
    Hi,

    I have a requirement.
    1.I need to build a random forest model on a training data set.Need to check the performance
    2.Apply an unseen testing data set and evaluate the performance.

    Please find the process created shared in the attached doc and let me know if iam using the correct approach
    I have tried with crosss validation and apply model operators.With the training set alone the squarred relation  was 0.969%.And the actual and predicted value for RUL column was very near to each other.But after giving testing set,there is a far difference in the values predicted and actual values.

    Also another doubt,if am using split data operator(when i trained only training dataset) there is far performance difference.Do we have to use this operator with models linear regression and random forest always?

    Thanks and regards,
  • Options
    nn_herenn_here Member Posts: 31 Contributor I
    Hi,
    I want to use optimize parameters operator(Grid) for my models built. Can you please let me know should we apply all the parameters of the model for optimization at one go or apply   each parameter one by one. This doubt  I have as it's taking lots and lots of time for optimizing just one parameter.

    Thanks and regards,

  • Options
    nn_herenn_here Member Posts: 31 Contributor I
    Hi,
    I have a scenario, where the number of datasets is 4 and the number of columns is different in each of the dataset. I need to pickup 2 columns from each of these dataset and create a new one. Can you please let me know if we have an option to achieve this..
    Thanks and regards.
  • Options
    ceaperezceaperez Member Posts: 522 Unicorn
    Hi @nn_here
    you can use the Select Attributes operator to select the columns (attributes) from each dataset and afther that use the Superset Operator, to joint them into a new dataset.

    Best, 
    Cesar
  • Options
    nn_herenn_here Member Posts: 31 Contributor I
    Thank you for the help. Will try this out.
     Thanks, and regards,
    nn_here
  • Options
    nn_herenn_here Member Posts: 31 Contributor I
    Hi,
    I tried using check outlier option in automodel tab of rapidminer.As the csv is having more than 2.5lakh rows,i decided to go with automodel .But it  is taking more than 1.5hour and counting for the same. Can you please let me know if we need to go by this option or we have any other operator to satisfy the same purpose..?
    Thanks in advance.
  • Options
    nn_herenn_here Member Posts: 31 Contributor I
    ceaperez  
    I tried with the operators you had suggested, 'you can use the Select Attributes operator to select the columns (attributes) from each dataset and after that use the Superset Operator, to joint them into a new dataset.'Can you tell me if the doubt i have is a valid one or not.I have 
    264960 rows in each of the dataset. Some of the values are missing. when i give superset from 2 datasets, it still shows the number as 264960.Shouldn't it display 264960*2 number of rows.KIndly correct me if my understanding is wrong.Also please find the attached process used.

    Thanks and regards,
    nn_here

Sign In or Register to comment.