Strategy to model, then predict / impute with very sparse target attribute?

ben_hben_h Member Posts: 17 Contributor II
edited November 2018 in Help
Please excuse vague title. I am currently using an unsupervised SOM clustering approach to try to determine values for a target attribute that is mostly missing. I am using SOM for several reasons I won't go into now, however I'm also open to other suggestions.

I have ~8000 observations of 10 attributes, the last of which is about 99.99% missing (the target). It has only about 17 observations, quite spread apart (the other attributes are mostly complete, but I think I can manage their missing values simply with means & medians).

The 'typical' workflow I am aware of from Wikipedia (!) is to split the data into training (66%) and test sets, train the SOM with the training set, and then map or predict with the test set on the trained SOM. In my case I am putting the entire data set into the SOM minus the target attribute (because it's mostly missing values), and then I don't know what to do from there.

I may be on the wrong track here, but if I have <20 observations with which to 'calibrate' my model, how do I follow this strategy?

I am not a statistician, and am finding it difficult to follow answers to other questions here and elsewhere, so please dumb down any response :)

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    your task is quite hard to impossible from a data mining point of view: basically you want to create a model with only 17 observations, which most probably won't deliver any good results.

    The general proceeding in a case like this is the following:
    - split your data into a training set with labelled data and a set with unlabelled data. In your case, the training set are your examples with the non-missing values for the target
    - in the training set, declare the target attribute as label (use Set Role)
    - train a model (and validate it, with the X-Validation)
    - apply the model on the rest of the data that does not have a value for the target
    - you're done :)

    However, as stated before, with only 17 observations you will get a model, but you won't get a good model. You should really try to get more training data!

    Best regards,
    Marius
Sign In or Register to comment.