evaluating text

MarkusWMarkusW Member Posts: 22 Contributor I
edited September 2021 in Help
Hi,
I'm trying to test, how well a simple machine does at predicting a property of a text (specifically sarcasm).
I have my data in a massive table, where one colomn is the source, one is the label, that should be predicted and the last colomn is the text, the algorithm(s) should analyze.
The problem is without some tool to extract meaning or sentiment the results are (not surprisingly) abysmal.
Both the promotional texts on the Rapid-miner main page and the professor, who suggested I use Rapid Miner, imply that there are such tools already part of Rapid Miner, however I have not yet found anything in the documentation /manual.

What are these tools called/how are they used?

Best Answers

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted
    Hi @MarkusW,

    RapidMiner has a Marketplace that you find in the menu ("Extensions"). There you will find the Text Processing and Web Mining extensions. 

    There's a full Text Mining course in the Academy:
    https://academy.rapidminer.com/courses/text-and-web-mining-with-rapidminer

    Regards,
    Balázs
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted
    Hi!

    Yes, sarcasm detection is a big challenge and simple models don't cut it.

    Have you seen "Automatic Classification of Documents" in the Academy course? 

    It explains the Process Documents operator. The only addition you would need here is "Generate n-Grams (Terms)". This will create new attributes of term combinations like "not very good" and "i really liked it". Of course, all combinations of subsequent words will be created, so this gives you a massive number of new attributes. This might help you with the sarcasm or not. 

    Naive Bayes and SVM are the modeling algorithms well suited for this situation. Other algorithms will take ages and don't perform well on this kind of data, with the possible exception of Deep Learning, but you'll need massive resources to execute that.

    Regards,
    Balázs

Answers

  • MarkusWMarkusW Member Posts: 22 Contributor I
    Thanks @BalazsBarany for the quick response.
    It seems

    MeaningCloud Text Analytics

    is exactly what I'm looking for (though I'll need a while to actually use it).
  • MarkusWMarkusW Member Posts: 22 Contributor I
    Okay @BalazsBarany I'm afraid the course you sent me, has two problems:
    One, it seems to be outdated, since it wants me to use the "Extract Content" operator (without actually explaining it), but no such operator exists in my version of RapidMiner.
    I assume the aquivalent has a different name.
    The second problem is, that it seems to have a different target, than what I need. The course only dictates how to specifically handle a table with a single column of text and how to do superficial analysis on them.
    What I have is a table with multiple column, only one of which contains text, the other containing the label, that is to be predicted.
    If I just let the AutoModel go, it'll just look at the correlation of words. What I need is something that can analyze the CONTENT of the relevant column to a degree (there's a reason I tagged it as sentiment analysis)
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi!

    Extract Content is in the Web Mining extension.

    In the Operator Toolbox you have two sentiment-related operators, these work unter some conditions (language etc.). You can take a look at them.

    If they are not good enough for your content, you'll need to build a sentiment model yourself using the methods in the Academy course. Sentiment will be the label here; if you don't have the labels yet, you'll need to score a couple of hundred typical texts yourself and use the manually assigned sentiment as the label. Then you would predict the sentiment in the first step, change the result to a normal attribute, and then use your label together with this new attribute.

    "Analyzing the content" is a very human-like activity. Text mining methods work by looking at terms or combinations of terms. You have full control over the process in RapidMiner, or you use an external service that does similar things in the background.

    Regards,
    Balázs
  • MarkusWMarkusW Member Posts: 22 Contributor I
    edited September 2021
    Hi again @BalazsBarany ,
    You don't happen to know the names of these operators? Are you referring to "sentiment analysis" and "aspect based sentiment analysis"?
    Prefacing the Auto-Model function with a simple text-mining method, so it'll look at terms instead of single words would already be a huge step forward.
    I'm afraid hand-training a sentiment-analysis algorithm (since I do not have an appropriatly labeled dataset), to preface the sarcasm detection, would go far beyond what I could achieve within a few weeks.
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    if you want to detect sarcasm as the label in your data but you don't have labeled data, then you can't use classical data mining here.

    You might be able to find a company that offers sarcasm detection as a service and use that. Or if you really need this for a company, you'll get some assistants to label a couple of hundred documents/texts so you can bootstrap a model.

    RapidMiner will help you when you have a labeled data set. The text mining operators are described in the Academy text mining course. You can use terms (n-grams) in the process.

    Regards,
    Balázs
  • MarkusWMarkusW Member Posts: 22 Contributor I
    Thanks @BalazsBarany , for your patience.
    Let me try one last time, to explain my situation: I have a table, Column one is the source, column two is the label sarcasm/notsarcasm, the last column is text.
    I want to see, how well a machine, that I can train in a day on my laptop is at predicting column two.
    If I just use Auto Model, it does generate machines, butthese machines are really bad, because they'll only look at the correlation between single words and the label.
    What I'd like to do is preface the Auto-Model with any kind of Textprocession whatsoever. The Manual, the Academy and the Docu from Rapid Miner are little to no help.
    I can't train any sentiment analysis, other than sarcasm/not sarcasm, because I only have the label sarc/notsarc.
    There is an operator "Sentiment Analysis" but neither the manual nor the Documentation say, what it'll do or how to incorperate it into the Auto-Model.
    The Tutorial, you sent me, is good, if I wanted to do exactly what it did, because it does not explain how, only what.

    Regards Markus
Sign In or Register to comment.