"Training with multiple CSVs"

TimFTimF Member Posts: 3 Contributor I
edited June 2019 in Help
Hi all!

Very sorry if this is a head slappingly basic question. I have tried to find an answer in the manual but I probably just don't know what I'm looking for!

I am using data from a series of races. I think I need to train my model with multiple races before I can try to predict a winner. But how do I set up my data so that RapidMiner knows that each race needs to be analysed as one event with one winner rather than a series of unrelated records containing winners and losers - should I use one CSV with a label or ID for each row that belongs to the same race, or should I have separate CSVs for each race - and if so how do I use multiple CSVs as input?


  • Options
    wesselwessel Member Posts: 537 Maven
    You should join all your training data into 1 single table.

    You need to encode your inputs in such a way to make learning as easy as possible.
    As a domain expert, you know, that an arbitrary choice like race-id is not predictive for the outcome of a race.
    Therefore, a race-id should not be included as a predictive variable.

    Sometimes it can be quit a bit of work to mangle your data in the exact format you need.
    For example, you may want to include the outcome of the previous race as a predictive variable for the current race.
    If you are not handy with manipulating data this can be a bit of work.
  • Options
    TimFTimF Member Posts: 3 Contributor I
    Thank you for the reply!
    If all of my training data is in a single table with a participant on each row I don't understand how the model can work - I thought that each race should be read as one data point, since the chance of one participant winning is affected by the strength of the other participants. Is this where I would use the 'batch' data role?
  • Options
    wesselwessel Member Posts: 537 Maven
    No, probably not.

    Give a few example data rows, maybe I can figure out how to manipulate your data.

  • Options
    TimFTimF Member Posts: 3 Contributor I
    Thank you for taking a look. This is a bit of the data I pulled out to learn with, comma separated. The 'Won' column is what I am trying to train my model to predict, 'Race_ID' is a unique text string for each race that was run, and 'Runner' is a text string for the name of each runner in that race. The other columns are various performance history or demographic data for that runner in the race.

    NO,TOD315,AMBER DREAM,N,30,10,1.7,7,2.8,54.3,
    YES,TOD315,THE PARK DANCER,N,6,16.7,2.5,4,0,55.7,
    NO,TOD315,CULLEN'S SHADOW,N,49,12.2,1.4,7,2.5,44.8,
    YES,TOD350,SAXON COAST,N,13,23.1,6.3,4,0,55,
    NO,TOD350,SALUTE THE SUN,Y,21,19,1.3,4,4.1,53.8,
    NO,TOD350,THYME FOR BUSINESS,N,8,25,2.2,5,2.5,49.7,
    NO,TOD350,THE FACTOR,N,15,13.3,1.6,5,3.5,50.8,
    NO,TOD425,ROMP TO FAME,Y,10,20,7.4,5,3.5,51.6,
    NO,TOD425,DRIVE WEST,N,15,6.7,2.7,4,2.4,51.3,
    YES,TOD425,FINNEGANS GOLD,N,29,10.3,1.8,6,0,54.2,

    And here is a CSV version of the same: https://dl.dropbox.com/u/17535287/october%202012%20results.csv
Sign In or Register to comment.