"Training with multiple CSVs"

TimF · November 2012

Hi all!

Very sorry if this is a head slappingly basic question. I have tried to find an answer in the manual but I probably just don't know what I'm looking for!

I am using data from a series of races. I think I need to train my model with multiple races before I can try to predict a winner. But how do I set up my data so that RapidMiner knows that each race needs to be analysed as one event with one winner rather than a series of unrelated records containing winners and losers - should I use one CSV with a label or ID for each row that belongs to the same race, or should I have separate CSVs for each race - and if so how do I use multiple CSVs as input?

wessel · November 2012

You should join all your training data into 1 single table.

You need to encode your inputs in such a way to make learning as easy as possible.
As a domain expert, you know, that an arbitrary choice like race-id is not predictive for the outcome of a race.
Therefore, a race-id should not be included as a predictive variable.

Sometimes it can be quit a bit of work to mangle your data in the exact format you need.
For example, you may want to include the outcome of the previous race as a predictive variable for the current race.
If you are not handy with manipulating data this can be a bit of work.

TimF · November 2012

Thank you for the reply!
If all of my training data is in a single table with a participant on each row I don't understand how the model can work - I thought that each race should be read as one data point, since the chance of one participant winning is affected by the strength of the other participants. Is this where I would use the 'batch' data role?

wessel · November 2012

No, probably not.

Give a few example data rows, maybe I can figure out how to manipulate your data.

TimF · November 2012

Thank you for taking a look. This is a bit of the data I pulled out to learn with, comma separated. The 'Won' column is what I am trying to train my model to predict, 'Race_ID' is a unique text string for each race that was run, and 'Runner' is a text string for the name of each runner in that race. The other columns are various performance history or demographic data for that runner in the race.

Won,Race_ID,Runner,FAV,STS,WIN%,API,AGE,RLEN,Rating,
NO,TOD315,AMBER DREAM,N,30,10,1.7,7,2.8,54.3,
YES,TOD315,THE PARK DANCER,N,6,16.7,2.5,4,0,55.7,
NO,TOD315,CULLEN'S SHADOW,N,49,12.2,1.4,7,2.5,44.8,
YES,TOD350,SAXON COAST,N,13,23.1,6.3,4,0,55,
NO,TOD350,SALUTE THE SUN,Y,21,19,1.3,4,4.1,53.8,
NO,TOD350,THYME FOR BUSINESS,N,8,25,2.2,5,2.5,49.7,
NO,TOD350,THE FACTOR,N,15,13.3,1.6,5,3.5,50.8,
NO,TOD425,ROMP TO FAME,Y,10,20,7.4,5,3.5,51.6,
NO,TOD425,DRIVE WEST,N,15,6.7,2.7,4,2.4,51.3,
NO,TOD425,DUGITE,N,48,12.5,1.1,7,2.6,51.9,
YES,TOD425,FINNEGANS GOLD,N,29,10.3,1.8,6,0,54.2,

And here is a CSV version of the same: https://dl.dropbox.com/u/17535287/october%202012%20results.csv

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Training with multiple CSVs"

Answers