Setting up my text classification process and I have a question. I want to classify a set of online comments (examples) into two classes (call them online behaviour A and online behaviour b. My plan is to put the classified comments into a growth curve model to see how the frequency of behaviour A and behaviour b change over time. It occurs to me that if behaviour A diminishes and behaviour b increases over time (as hypothesized), then there will be many instances of comments that exhibit aspects of both behaviours. My thinking is that if I calculate probability of class membership (in A and/or b) for each example comment, I will capture the instances of examples that fall into both classes and I can then use a cutoff to use to select comments for the growth curve model). My question is this: when developing the training set, do I classify comments that I consider either A or b only (and let Rapidminer assign class membership percentage for all comments on this basis) OR should I also classify training example comments I consider as belonging to both A and b as I develop the training set? I am assuming a binomial classification (A or b) but wonder if I need multi-label classification with a third class representing a blend of A and b with blended comments identified in the training set. Appreciate any insight you can provide.
