Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
How to select the right data for prediction?
User111113
Member Posts: 24 Maven
in Help
Hi All,
I have about 2 years of historical data which I can probably use to predict responses.
For example if I have to predict my response rate for Jan 2020 how can I say how much data would be enough to come close to actual rate.
------ should I look at how my data performed in Jan 2018, Jan 2019 and may be last 4 months from 2019
----- or it should be last for months of 2019 and Jan 2019
----- or may be use everything I have which I am not comfortable with because of many outliers
when I compared actual and predicted for past few months they don't seem close at all because it was done manually (on a piece of paper)
How to select right data?
Thank you.
I have about 2 years of historical data which I can probably use to predict responses.
For example if I have to predict my response rate for Jan 2020 how can I say how much data would be enough to come close to actual rate.
------ should I look at how my data performed in Jan 2018, Jan 2019 and may be last 4 months from 2019
----- or it should be last for months of 2019 and Jan 2019
----- or may be use everything I have which I am not comfortable with because of many outliers
when I compared actual and predicted for past few months they don't seem close at all because it was done manually (on a piece of paper)
How to select right data?
Thank you.
1
Best Answer
-
PaulMSimpson Member Posts: 8 Contributor IILet me help you split your data on a date, as many months back as you prefer. I'm fairly new to RapidMiner, having done most of my data science work in R previously. Therefore, I don't know if what I'm about to show you is the simplest or best way to split a dataset on a date, but it does work.
First, you would need to create a third column, one that holds your month column, "/1/" and your year column, so that now you will have actual date values for all of your records, such as 5/1/2018. I recommend using the Generate Attributes operator, then Edit List by adding an attribute name of "myDate", and in the function expressions field, put this: date_parse([yourMonthCol] + "/" + [yourYearCol]), using the name of your own month column and year column, of course.
Second, after your retrieve operator, place only one Filter Examples operator (You only need one of these because you will pipe the "unm" node with all unmatched records to be your test data. Anyway, I used the "expression" condition class, and note what I put into the parameter expression, using the date_before() function. The first param is your date field's name, and the second is a date_parse(), where you convert a string that represents the date that you plan to be the date split point into a date data type.
7
Answers
Thank you for your response. I will try both the ways and which method would be better to test accuracy in this case?
For validation I use cross or split but in this case I would use cross or any other suggestions are welcome.
I did a performance test by putting original data for performance I predicted response rate 4 month (july-oct) and I already have the actual/original so I fed that as an input to see how much the result set would deviate from original and I got root mean squared error as 0.016
which isn't bad what do you think?
Another way I thought is to add status column before loading data in RM which I did and divided it between old/new but still split operator takes only standard value like ratio and other default columns... how to split using status column from my data.
Also I made RR column blank where status is new because that would be my test data.
kindly help, thank you.