Predictions based on US baby names data?

guy_davisguy_davis Member Posts: 1 Contributor I
edited November 2018 in Help

Good day, 

 

I'm new to Rapid Miner and predictive analytics.  I'm trying to move beyond the tutorials (which are great!) by using the US baby names (state-by-state) found on Kaggle.  I'm able to load a random sample (1000 records) of the state-by-state data in:

  • id (ID type)
  • name (nominal type)
  • gender (binominal type)
  • state (nominal type)
  • year (integer type)
  • count (weight type)

Then I use another random selection to get 20 records without the state attribute.  I'd like make a prediction of birth state based on name, gender, and birth year.  I'm sure this is a contrived example, but I thought I'd give it a try.  Alternatively, I'd like to predict birth year given name, gender, and state.  What would be some interesting models to try in this case?

 

I've tried using Decision Tree to generate a model from the training data and Apply Model to the random Test Data.  As best I can tell, Decision Tree is only working on year and gender, ignoring name.  Is there anyway to get this model to consider name?  Perhaps the issue is that I can't train on more than 1000 records due to licensing?

 

process.pngProcess so far...decision_tree.pngDecision tree on year, then sometime gender.

 results.png

Thanks in advance,

Guy

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi Guy_david,

     

    welcome to the community! Seems like a funny project to work on. Could be also some kind of marketing for us :).


    A few things:


    First of all the reason why the tree is not considering the names itself is, that they are not statistically significant. Most likely a cut on a specific name is simply not "big" enough to be counted as signficant. You might want to reduce the min_gain to let the tree grow deeper. Be aware that this might yield to overtraining. I could imagine that using the Namsor Extension to get the Origin for a name could be helpful.

     

    Another thing is, that it will be very hard to predict each 50 states. I would boil it down to more regional areas like West coast, east cost, south, mid west or something. This makes the problem way easier.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.