Options

"text mining (classification

mksaadmksaad Member Posts: 42 Maven
edited May 2019 in Help
Hello all,

I read many tutorials about text mining (TM) including tutorials about TM using RM.

most of these tutorials uses support vector machine (SVM) and Naive-Bayes (NB) as classification methods. I conclude they are the best Algorithm for text classification.
do you recommend me to use these algorithm or there are other suitable algorithms for text classification. (I am looking for Algorithms that implemented in RM)
If SVM and NB are the best one, any references about that will be appreciated.


I also appreciate any recommendation of RM clustering algorithms for text.


Thanks in advance,
--
Motaz K. Saad

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I would suggest any clustering algorithm supporting the Cosine Similarity. And as always KMeans is worth a try.

    Greetings,
      Sebastian
  • Options
    gunjanamitgunjanamit Member Posts: 28 Contributor II
    Motaz,

    Have you done anything on Text Classification?

    I need help there...
  • Options
    mksaadmksaad Member Posts: 42 Maven
    Hello,

    You can take a look at http://sites.google.com/site/motazsite/publications

    you can find there conclusions on Arabic text classification and conclusions text classification in general.


    Regards,
    Motaz
  • Options
    jforrjforr Member Posts: 7 Contributor II
    Is there a good algorithm to use when my documents can have multiple categories assigned to them?  An example might be resumes where some are Java developers, some are SQL developers, and some are both Java and SQL developers?
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi, you can use Polynominal by Binominal Classification for this. This operator trains a model based on its inner process, where it tries to discriminate between each class and all other classes. During application the confidence for each class is calculated, and the one with the highest value is predicted. Please have a look at the attached process.

    Best, Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.006">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.006" expanded="true" name="Process">
        <process expanded="true" height="494" width="752">
          <operator activated="true" class="generate_data" compatibility="5.2.006" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
            <parameter key="target_function" value="three ring clusters"/>
            <parameter key="number_of_attributes" value="2"/>
          </operator>
          <operator activated="true" class="polynomial_by_binomial_classification" compatibility="5.2.006" expanded="true" height="76" name="Polynominal by Binominal Classification" width="90" x="246" y="30">
            <process expanded="true" height="512" width="770">
              <operator activated="true" class="naive_bayes" compatibility="5.2.006" expanded="true" height="76" name="Naive Bayes" width="90" x="313" y="30"/>
              <connect from_port="training set" to_op="Naive Bayes" to_port="training set"/>
              <connect from_op="Naive Bayes" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="apply_model" compatibility="5.2.006" expanded="true" height="76" name="Apply Model" width="90" x="461" y="30">
            <list key="application_parameters"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Polynominal by Binominal Classification" to_port="training set"/>
          <connect from_op="Polynominal by Binominal Classification" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Polynominal by Binominal Classification" from_port="example set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_port="result 2"/>
          <connect from_op="Apply Model" from_port="model" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
  • Options
    jforrjforr Member Posts: 7 Contributor II
    Thanks, I'll try that.
Sign In or Register to comment.