I'm new to Rapid Miner and predictive analytics. I'm trying to move beyond the tutorials (which are great!) by using the US baby names (state-by-state) found on Kaggle. I'm able to load a random sample (1000 records) of the state-by-state data in:
Then I use another random selection to get 20 records without the state attribute. I'd like make a prediction of birth state based on name, gender, and birth year. I'm sure this is a contrived example, but I thought I'd give it a try. Alternatively, I'd like to predict birth year given name, gender, and state. What would be some interesting models to try in this case?
I've tried using Decision Tree to generate a model from the training data and Apply Model to the random Test Data. As best I can tell, Decision Tree is only working on year and gender, ignoring name. Is there anyway to get this model to consider name? Perhaps the issue is that I can't train on more than 1000 records due to licensing?
Thanks in advance,
welcome to the community! Seems like a funny project to work on. Could be also some kind of marketing for us .
A few things:
First of all the reason why the tree is not considering the names itself is, that they are not statistically significant. Most likely a cut on a specific name is simply not "big" enough to be counted as signficant. You might want to reduce the min_gain to let the tree grow deeper. Be aware that this might yield to overtraining. I could imagine that using the Namsor Extension to get the Origin for a name could be helpful.
Another thing is, that it will be very hard to predict each 50 states. I would boil it down to more regional areas like West coast, east cost, south, mid west or something. This makes the problem way easier.