Auto Categorization of documents

sangeetsangeet Member Posts: 10 Contributor I
edited November 2018 in Help


Can you guide me in auto categorization of documents.

So, in DB we have a lot of long description of ticket data ( email conversation, or resolution data etc) , i need to train the classifier such that any new incoming ticket should be auto categorized to the right category.


STEPS TAKEN till now.

1) Tried to do unsupervised learning, to form clusters of words.

2) Used Naive Bayes classifier, but here I have manually labelled the training data set.


Any way which you can suggest in which I can do auto labelling of the text which can be used as training data.


Eagerly looking for your help.


  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Do a search through the community forums for some sample processes, that'll get you started.

  • sangeetsangeet Member Posts: 10 Contributor I

    Thanks Master.


    Is it possible to get the cluster (with keywords in it ) and try to classify the new text to fall in respective cluster ?

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn


    Give this a try.


    <?xml version="1.0" encoding="UTF-8"?><process version="7.3.001">
    <operator activated="true" class="process" compatibility="7.3.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
    <parameter key="connection" value="Twitter Connection"/>
    <parameter key="query" value="DonaldTrump"/>
    <parameter key="limit" value="1000"/>
    <parameter key="language" value="en"/>
    <operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Text|Id"/>
    <operator activated="true" class="nominal_to_text" compatibility="7.3.001" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.3.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.3.000" expanded="true" height="68" name="Tokenize" width="90" x="313" y="34"/>
    <operator activated="true" class="text:transform_cases" compatibility="7.3.000" expanded="true" height="68" name="Transform Cases" width="90" x="447" y="34"/>
    <operator activated="true" class="text:filter_by_length" compatibility="7.3.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="581" y="34"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
    <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    <operator activated="true" class="multiply" compatibility="7.3.001" expanded="true" height="103" name="Multiply" width="90" x="514" y="210"/>
    <operator activated="true" class="transpose" compatibility="7.3.001" expanded="true" height="82" name="Transpose" width="90" x="715" y="30"/>
    <operator activated="true" class="x_means" compatibility="7.3.001" expanded="true" height="82" name="X-Means" width="90" x="849" y="30"/>
    <operator activated="true" class="select_attributes" compatibility="7.3.001" expanded="true" height="82" name="Select Attributes (3)" width="90" x="648" y="300">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value="cluster"/>
    <parameter key="attributes" value="|sentiment|cluster"/>
    <parameter key="invert_selection" value="true"/>
    <parameter key="include_special_attributes" value="true"/>
    <operator activated="true" class="numerical_to_binominal" compatibility="7.3.001" expanded="true" height="82" name="Numerical to Binominal" width="90" x="648" y="390"/>
    <operator activated="true" class="fp_growth" compatibility="7.3.001" expanded="true" height="82" name="FP-Growth" width="90" x="648" y="493">
    <parameter key="find_min_number_of_itemsets" value="false"/>
    <parameter key="min_number_of_itemsets" value="10"/>
    <parameter key="min_support" value="0.6"/>
    <parameter key="max_items" value="5"/>
    <operator activated="true" class="create_association_rules" compatibility="7.3.001" expanded="true" height="82" name="Create Association Rules" width="90" x="782" y="480"/>
    <operator activated="true" class="item_sets_to_data" compatibility="7.3.001" expanded="true" height="82" name="Item Sets to Data" width="90" x="916" y="544"/>
    <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Select Attributes (3)" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="X-Means" to_port="example set"/>
    <connect from_op="X-Means" from_port="cluster model" to_port="result 1"/>
    <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
    <connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
    <connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
    <connect from_op="Create Association Rules" from_port="rules" to_port="result 2"/>
    <connect from_op="Create Association Rules" from_port="item sets" to_op="Item Sets to Data" to_port="frequent item sets"/>
    <connect from_op="Item Sets to Data" from_port="example set" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
  • sangeetsangeet Member Posts: 10 Contributor I

    My use case goes like.



    1) Long description of a problem statement (Uncategorized)

    2) Form categories out of it, based on keywords/phrases/POS tagging.

    3) Assign the above mentioned category to the new incoming text.

  • sangeetsangeet Member Posts: 10 Contributor I

    I get the clustered model in place. Where I have keywords and in which cluster it falls in .

    Now how can I take this unsupervised learning to make a supervised learning model to further classify an incoming text to make it fall in a cluster or a category

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    You can add the cluster as a label (there's an option for that), and then use that label to build a predictive model, if you want to try to replicate document classification into those same clusters in the future.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • sangeetsangeet Member Posts: 10 Contributor I

    Can you please elucidate on the steps taken after clustering in done (Clustered Set and Cluster Model). How can we make use of this to do a Supervised Learning ?

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Once you have the cluster assigned as a label, you would then have a dataset that you could use with any of the standard machine learning approaches to classification.  There are a number of helpful RapidMiner video tutorials on building such models available in the resources on this site:

    There are also as guided processes available directly from within RapidMiner (just click on the "Learn" button on the splash screen after startup).  


    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • sangeetsangeet Member Posts: 10 Contributor I

    Hello Brian, 



    So the use case in a simple sense is somewhat a combination of Supervised and unsupervised learning, Topic modelling LDA.


    1) Documents(rows) typically will have 'detailed description' of a problem pertaining to any field. [For eg. CPU Usage, Memory Issue, Network Error]. This data is UN TAGGED (un labelled)

    2) Now we need to find out keywords(N grams, POS) from each category and make a rule book, which says, these kind of words/phrases falls into certain category ( Here in short we are doing clustering by fetching relevant words/phrases for each category) (Un supervised learning)

    3) Now based on above step, we want to TAG a new incoming document (by analysing the content in it, keywords/phrases) (SUpervised Learning)

  • sangeetsangeet Member Posts: 10 Contributor I

    any updates ?

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Hi @sangeet,


    @Thomas_Ott and I have already provided some direction about how we'd approach the problem.  Based on your described use case, we recommended the following:

    1. Process your text documents and create a word vector and wordlist
    2. Cluster your documents (Tom provided a sample process to cover these 2 steps)
    3. Once you have those clusters defined, assign them as labels (use the "Set Role" operator for that)
    4. Then use that same dataset to create a supervised learning model to predict the clusters (as I noted earlier, there are plenty of tutorials available for this step)
    5. You can then store that model and apply that model to any new documents (you'll need to do the same set of text processing and use the same wordlist as well)


    So I'm not sure what else you are expecting at this point. Did you have a more specific question, or a problem that you ran into when you tried to complete the steps above?  Please remember that this is a free user community forum.  If you are interested in a more detailed consulting project, you can feel free to PM me.  



    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.