Text mining for open-ended questions?

mafern76mafern76 Member Posts: 45 Contributor II
Hi, how are you?

I was recently presented with the following problem, a company has around 20.000 answers to one open question "What are the aspects of your work you like the most?", and they would like to analyze those answers.

I already worked manually analyzing around 300 of them, getting several flags, for example, HELPING_CUSTOMERS, SHORT_HOURS, etc.

My idea was to simply make a model for each flag and predict the remaining 20.000 answers, obtaining percentages regarding how many employees value each flag.

1. I was wondering if there is another approach to this and what would be the advantage over simply sampling the 20.000, getting percentages and extrapolating those, statistically, regardless of predictive models based on text.

2. Another valid question would be what is the difference between text mining and simply a tag cloud, but that is something that remains to be seen and I guess it depends on each individual problem. For example a more neutral question like "What do you think about your job?" may contain positive and negative sentiments using the same words, but right now I'm working on a question biased towards recieving positive sentiments.

Thanks a lot for your insight, I'll make sure to share mine!

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,127  RM Data Scientist
    Hello!

    My idea was to simply make a model for each flag and predict the remaining 20.000 answers, obtaining percentages regarding how many employees value each flag.
    Sounds reasonable

    1. I was wondering if there is another approach to this and what would be the advantage over simply sampling the 20.000, getting percentages and extrapolating those, statistically, regardless of predictive models based on text.
    Do you mean you would simply extrapolate the percantages you found in the 300 examples to the 20.000? That would introduce big errors.

    Lets assume you have a lot of classes and worked manunally through 100 examples. 9 of them are class A.
    The standard deviation of this is 3, because it is possonian distributed. Thus you have an relative error of 30%. This relative error remains the same if you scale it up.
    Lets say you have 10.000 examples in your full dataset. You would predict 900 for class A. Including the error you have to say: In ~68% of the cases, 900 +/- 300 persons are in class A. This is not a strong answer!

    The classification on the other hand might predict way more accurate results.

    2. Another valid question would be what is the difference between text mining and simply a tag cloud, but that is something that remains to be seen and I guess it depends on each individual problem. For example a more neutral question like "What do you think about your job?" may contain positive and negative sentiments using the same words, but right now I'm working on a question biased towards recieving positive sentiments.
    Have you thought about using a clustering algorithm like k-Means with cosine similarity on your data? Might be worth a try to find groups of answers.
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mafern76mafern76 Member Posts: 45 Contributor II
    Hi mschmitz, thanks for your answer!

    Well in my case I found the classes somewhat hard to predict.

    For example with 30 class A in those 300, it's a 30/270 model. For velocity I tried default Naive Bayes and Decision Tree with 10-VAL.

    Sometimes the models where overall bad, sometimes class precision was good (90%+) but with mediocre recall, around 50%. So even with the good models, results wouldn't be very simple to interpret... would I predict all the examples, get a number, multiply it by 0.9 and then by 2? I really don't have the knowledge to decide I should trust more a 10-VAL than a simple 95% confidence level interval... and with classes that can't get good models I simply use the confidence level interval. I don't know how to communicate a X-VAL confidence interval.

    What do you think?

    Thanks!

    I haven't tried unsupervised but I might give it a try if I have time. As always time is a constraint.
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,127  RM Data Scientist
    Hello mafern,

    sorry for replying that late. I would personally trust in the X-Val!

    I would definity try other models. A Dec-Tree is not really good on text data. I would recommend a radial SVM. Be sure to optimize gamma and C there.
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mafern76mafern76 Member Posts: 45 Contributor II
    Thank you for your answer!

    How would you select features for the radial SVM in text mining?
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,127  RM Data Scientist
    Hi,

    first of all i would use stemming and pruning. Pruning is a option of the Process Documents operator while there are several stemming operators avaible.

    How many attributes are left after stemming in pruning? If it is something like 500 I would try to run on those. Otherwise my first approach would be Weight by SVM and use those weights for feature selection. As always: "There is no free lunch".

    Best,

    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mafern76mafern76 Member Posts: 45 Contributor II
    Hi, thanks for your answer!

    Yes I did some stemming and pruning, also I removed correlated attributes higher than 0.99, I ended up with about 800...

    I used naive and decision tree to deal with the number of attributes.

    I could give SVM weighting a try before optimizing the proper SVM...

    Thanks!!
Sign In or Register to comment.