Auto Categorization of documents

sangeet · January 2017

Can you guide me in auto categorization of documents.

So, in DB we have a lot of long description of ticket data ( email conversation, or resolution data etc) , i need to train the classifier such that any new incoming ticket should be auto categorized to the right category.

STEPS TAKEN till now.

1) Tried to do unsupervised learning, to form clusters of words.

2) Used Naive Bayes classifier, but here I have manually labelled the training data set.

Any way which you can suggest in which I can do auto labelling of the text which can be used as training data.

Eagerly looking for your help.

Thomas_Ott · January 2017

Do a search through the community forums for some sample processes, that'll get you started.

sangeet · January 2017

Thanks Master.

Is it possible to get the cluster (with keywords in it ) and try to classify the new text to fall in respective cluster ?

Thomas_Ott · January 2017

Yes.

Give this a try.

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
        <parameter key="connection" value="Twitter Connection"/>
        <parameter key="query" value="DonaldTrump"/>
        <parameter key="limit" value="1000"/>
        <parameter key="language" value="en"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Text|Id"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.3.001" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.3.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.3.000" expanded="true" height="68" name="Tokenize" width="90" x="313" y="34"/>
          <operator activated="true" class="text:transform_cases" compatibility="7.3.000" expanded="true" height="68" name="Transform Cases" width="90" x="447" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="7.3.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="581" y="34"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="514" y="210"/>
      <operator activated="true" class="transpose" compatibility="7.3.001" expanded="true" height="82" name="Transpose" width="90" x="715" y="30"/>
      <operator activated="true" class="x_means" compatibility="7.3.001" expanded="true" height="82" name="X-Means" width="90" x="849" y="30"/>
      <operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes (3)" width="90" x="648" y="300">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="cluster"/>
        <parameter key="attributes" value="|sentiment|cluster"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="numerical_to_binominal" compatibility="7.3.001" expanded="true" height="82" name="Numerical to Binominal" width="90" x="648" y="390"/>
      <operator activated="true" class="fp_growth" compatibility="7.3.001" expanded="true" height="82" name="FP-Growth" width="90" x="648" y="493">
        <parameter key="find_min_number_of_itemsets" value="false"/>
        <parameter key="min_number_of_itemsets" value="10"/>
        <parameter key="min_support" value="0.6"/>
        <parameter key="max_items" value="5"/>
      </operator>
      <operator activated="true" class="create_association_rules" compatibility="7.3.001" expanded="true" height="82" name="Create Association Rules" width="90" x="782" y="480"/>
      <operator activated="true" class="item_sets_to_data" compatibility="7.3.001" expanded="true" height="82" name="Item Sets to Data" width="90" x="916" y="544"/>
      <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Transpose" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Select Attributes (3)" to_port="example set input"/>
      <connect from_op="Transpose" from_port="example set output" to_op="X-Means" to_port="example set"/>
      <connect from_op="X-Means" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
      <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
      <connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
      <connect from_op="Create Association Rules" from_port="rules" to_port="result 2"/>
      <connect from_op="Create Association Rules" from_port="item sets" to_op="Item Sets to Data" to_port="frequent item sets"/>
      <connect from_op="Item Sets to Data" from_port="example set" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

sangeet · January 2017

My use case goes like.

1) Long description of a problem statement (Uncategorized)

2) Form categories out of it, based on keywords/phrases/POS tagging.

3) Assign the above mentioned category to the new incoming text.

sangeet · January 2017

I get the clustered model in place. Where I have keywords and in which cluster it falls in .

Now how can I take this unsupervised learning to make a supervised learning model to further classify an incoming text to make it fall in a cluster or a category

Telcontar120 · January 2017

You can add the cluster as a label (there's an option for that), and then use that label to build a predictive model, if you want to try to replicate document classification into those same clusters in the future.

sangeet · January 2017

Can you please elucidate on the steps taken after clustering in done (Clustered Set and Cluster Model). How can we make use of this to do a Supervised Learning ?

Telcontar120 · January 2017

Once you have the cluster assigned as a label, you would then have a dataset that you could use with any of the standard machine learning approaches to classification. There are a number of helpful RapidMiner video tutorials on building such models available in the resources on this site: https://rapidminer.com/getting-started-central/

There are also as guided processes available directly from within RapidMiner (just click on the "Learn" button on the splash screen after startup).

sangeet · January 2017

Hello Brian,

So the use case in a simple sense is somewhat a combination of Supervised and unsupervised learning, Topic modelling LDA.

1) Documents(rows) typically will have 'detailed description' of a problem pertaining to any field. [For eg. CPU Usage, Memory Issue, Network Error]. This data is UN TAGGED (un labelled)

2) Now we need to find out keywords(N grams, POS) from each category and make a rule book, which says, these kind of words/phrases falls into certain category ( Here in short we are doing clustering by fetching relevant words/phrases for each category) (Un supervised learning)

3) Now based on above step, we want to TAG a new incoming document (by analysing the content in it, keywords/phrases) (SUpervised Learning)

sangeet · January 2017

any updates ?

Telcontar120 · January 2017

Hi @sangeet,

@Thomas_Ott and I have already provided some direction about how we'd approach the problem. Based on your described use case, we recommended the following:

Process your text documents and create a word vector and wordlist
Cluster your documents (Tom provided a sample process to cover these 2 steps)
Once you have those clusters defined, assign them as labels (use the "Set Role" operator for that)
Then use that same dataset to create a supervised learning model to predict the clusters (as I noted earlier, there are plenty of tutorials available for this step)
You can then store that model and apply that model to any new documents (you'll need to do the same set of text processing and use the same wordlist as well)

So I'm not sure what else you are expecting at this point. Did you have a more specific question, or a problem that you ran into when you tried to complete the steps above? Please remember that this is a free user community forum. If you are interested in a more detailed consulting project, you can feel free to PM me.

Best,

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Auto Categorization of documents

Answers