RapidMiner

Auto Categorization of documents

Contributor II

Auto Categorization of documents

[ Edited ]

 

Can you guide me in auto categorization of documents.

So, in DB we have a lot of long description of ticket data ( email conversation, or resolution data etc) , i need to train the classifier such that any new incoming ticket should be auto categorized to the right category.

 

STEPS TAKEN till now.

1) Tried to do unsupervised learning, to form clusters of words.

2) Used Naive Bayes classifier, but here I have manually labelled the training data set.

 

Any way which you can suggest in which I can do auto labelling of the text which can be used as training data.

 

Eagerly looking for your help.

See more topics labeled with:

11 REPLIES
Community Manager

Re: Auto Categorization of documents

Do a search through the community forums for some sample processes, that'll get you started.

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Contributor II

Re: Auto Categorization of documents

Thanks Master.

 

Is it possible to get the cluster (with keywords in it ) and try to classify the new text to fall in respective cluster ?

Community Manager

Re: Auto Categorization of documents

[ Edited ]

Yes. 


Give this a try.

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
        <parameter key="connection" value="Twitter Connection"/>
        <parameter key="query" value="DonaldTrump"/>
        <parameter key="limit" value="1000"/>
        <parameter key="language" value="en"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Text|Id"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="7.3.001" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="7.3.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="7.3.000" expanded="true" height="68" name="Tokenize" width="90" x="313" y="34"/>
          <operator activated="true" class="text:transform_cases" compatibility="7.3.000" expanded="true" height="68" name="Transform Cases" width="90" x="447" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="7.3.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="581" y="34"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="514" y="210"/>
      <operator activated="true" class="transpose" compatibility="7.3.001" expanded="true" height="82" name="Transpose" width="90" x="715" y="30"/>
      <operator activated="true" class="x_means" compatibility="7.3.001" expanded="true" height="82" name="X-Means" width="90" x="849" y="30"/>
      <operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes (3)" width="90" x="648" y="300">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="cluster"/>
        <parameter key="attributes" value="|sentiment|cluster"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="numerical_to_binominal" compatibility="7.3.001" expanded="true" height="82" name="Numerical to Binominal" width="90" x="648" y="390"/>
      <operator activated="true" class="fp_growth" compatibility="7.3.001" expanded="true" height="82" name="FP-Growth" width="90" x="648" y="493">
        <parameter key="find_min_number_of_itemsets" value="false"/>
        <parameter key="min_number_of_itemsets" value="10"/>
        <parameter key="min_support" value="0.6"/>
        <parameter key="max_items" value="5"/>
      </operator>
      <operator activated="true" class="create_association_rules" compatibility="7.3.001" expanded="true" height="82" name="Create Association Rules" width="90" x="782" y="480"/>
      <operator activated="true" class="item_sets_to_data" compatibility="7.3.001" expanded="true" height="82" name="Item Sets to Data" width="90" x="916" y="544"/>
      <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Transpose" to_port="example set input"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Select Attributes (3)" to_port="example set input"/>
      <connect from_op="Transpose" from_port="example set output" to_op="X-Means" to_port="example set"/>
      <connect from_op="X-Means" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
      <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
      <connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
      <connect from_op="Create Association Rules" from_port="rules" to_port="result 2"/>
      <connect from_op="Create Association Rules" from_port="item sets" to_op="Item Sets to Data" to_port="frequent item sets"/>
      <connect from_op="Item Sets to Data" from_port="example set" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>
Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Contributor II

Re: Auto Categorization of documents

[ Edited ]

My use case goes like.

 

 

1) Long description of a problem statement (Uncategorized)

2) Form categories out of it, based on keywords/phrases/POS tagging.

3) Assign the above mentioned category to the new incoming text.

Contributor II

Re: Auto Categorization of documents

I get the clustered model in place. Where I have keywords and in which cluster it falls in .

Now how can I take this unsupervised learning to make a supervised learning model to further classify an incoming text to make it fall in a cluster or a category

Elite III

Re: Auto Categorization of documents

You can add the cluster as a label (there's an option for that), and then use that label to build a predictive model, if you want to try to replicate document classification into those same clusters in the future.

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Contributor II

Re: Auto Categorization of documents

Can you please elucidate on the steps taken after clustering in done (Clustered Set and Cluster Model). How can we make use of this to do a Supervised Learning ?

Elite III

Re: Auto Categorization of documents

Once you have the cluster assigned as a label, you would then have a dataset that you could use with any of the standard machine learning approaches to classification.  There are a number of helpful RapidMiner video tutorials on building such models available in the resources on this site: https://rapidminer.com/getting-started-central/

There are also as guided processes available directly from within RapidMiner (just click on the "Learn" button on the splash screen after startup).  

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Contributor II

Re: Auto Categorization of documents

Hello Brian, 

 

 

So the use case in a simple sense is somewhat a combination of Supervised and unsupervised learning, Topic modelling LDA.

 

1) Documents(rows) typically will have 'detailed description' of a problem pertaining to any field. [For eg. CPU Usage, Memory Issue, Network Error]. This data is UN TAGGED (un labelled)

2) Now we need to find out keywords(N grams, POS) from each category and make a rule book, which says, these kind of words/phrases falls into certain category ( Here in short we are doing clustering by fetching relevant words/phrases for each category) (Un supervised learning)

3) Now based on above step, we want to TAG a new incoming document (by analysing the content in it, keywords/phrases) (SUpervised Learning)