Options

Noob question: multiple attributes, single value of interest, best model?

user197433user197433 Member Posts: 2 Contributor I
I am new to data mining and have messed with rapidminer a little bit with some of the decision trees and am now wanting to take some very common situations that pop up in my line of work and do some analysis on these to see if we can learn anything from the data.  I'm interested in what the community thinks are the best models / methods to use to analyze the data.

For an example, lets say I have some real-estate data regarding buyers of homes.  I have a spreadsheet of all the recent purchases within a zip code of interest and some basic demographic data.  The columns of the sheet may be:

Address
Bedrooms
Bathrooms
Square Feet
Purchase Price
Buyer Age
Buyer Gender
Buyer Ethnicity
Buyer Annual Income
Buyer Marital Status

Lets say the question I'm interested is who do I market a $300k house to versus a $700k house to? 

I know I could build scatter plots showing each metric's grouping based on price, but are there good statistical models to apply to this type of analysis that would spit out some interesting factoids or views of data:

e.g. 90% of buyers of homes costing $700k are married couples with annual income > $120k.

Thoughts?

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    you have multiple target attributes (buyer age, gender, income...), so you basically have 2 possibilities:

    You could learn models to predict each of them on its own, i.e. read in the data, define one target attribute (e.g. age) as label and remove the other target attributes. Then train a model, e.g. a decision tree. You can repeat that for all target attributes.

    The second possibility is to encode all target attributes into one single label attribute like 48_male_120000_married. That way you would catch the dependencies between the the different target values, but you would also end up with a multi-class problem with probably a lot of classes, so you would need a large amount of data for good results.

    In both cases you should choose "speaking" and human readable models like decision trees, naive bayes or linear svm, such that you can read the desired information from the model.

    Best,
    Marius
  • Options
    user197433user197433 Member Posts: 2 Contributor I
    Thank you, this is helpful insight.
Sign In or Register to comment.