RapidMiner

Newbie guy_davis
Newbie

Predictions based on US baby names data?

Good day, 

 

I'm new to Rapid Miner and predictive analytics.  I'm trying to move beyond the tutorials (which are great!) by using the US baby names (state-by-state) found on Kaggle.  I'm able to load a random sample (1000 records) of the state-by-state data in:

  • id (ID type)
  • name (nominal type)
  • gender (binominal type)
  • state (nominal type)
  • year (integer type)
  • count (weight type)

Then I use another random selection to get 20 records without the state attribute.  I'd like make a prediction of birth state based on name, gender, and birth year.  I'm sure this is a contrived example, but I thought I'd give it a try.  Alternatively, I'd like to predict birth year given name, gender, and state.  What would be some interesting models to try in this case?

 

I've tried using Decision Tree to generate a model from the training data and Apply Model to the random Test Data.  As best I can tell, Decision Tree is only working on year and gender, ignoring name.  Is there anyway to get this model to consider name?  Perhaps the issue is that I can't train on more than 1000 records due to licensing?

 

Process so far...Process so far...Decision tree on year, then sometime gender.Decision tree on year, then sometime gender.

 results.png

Thanks in advance,

Guy

1 REPLY
RM Staff
RM Staff

Re: Predictions based on US baby names data?

Hi Guy_david,

 

welcome to the community! Seems like a funny project to work on. Could be also some kind of marketing for us Smiley Happy.


A few things:


First of all the reason why the tree is not considering the names itself is, that they are not statistically significant. Most likely a cut on a specific name is simply not "big" enough to be counted as signficant. You might want to reduce the min_gain to let the tree grow deeper. Be aware that this might yield to overtraining. I could imagine that using the Namsor Extension to get the Origin for a name could be helpful.

 

Another thing is, that it will be very hard to predict each 50 states. I would boil it down to more regional areas like West coast, east cost, south, mid west or something. This makes the problem way easier.

 

~Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Twitter Feed