"Text Mining - Document Similarity/Clustering"

rahi84 · August 2015

Hello All,

I am trying to perform document similarity/clustering in RapidMiner on a survey text field and having problems so far. The data is saved in an Excel file (.xlsx) and I need to process the documents so that the case is lowered, words are tokenized, stemmed and the stopwords filtered out. Could you please run me through the nodes that I need to assign to the data so that I can perform a document similarity and clustering. I have watched 'el chief' tutorials on YouTube and unfortunately it hasn't worked out. I have tried the following nodes (in order) and I get a blank output:

1. Read Excel
2. Data to Documents
3. Process Documents (+ Tokenize, Filter Stopwords( English), Transform Cases, Stem (Porter))
4. Data Similarity

MartinLiebig · August 2015

Hi,

is your text attribute of type text or nominal? You need to use text in order to use data to document. Further i would recommend to use cross distances instead of data to similarity.

Attached is a sample process.

Best,
Martin

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="false" class="read_excel" compatibility="6.4.000" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
<parameter key="excel_file" value="C:\Users\elie.rahi\Desktop\############\###############\###########.xlsx"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="subprocess" compatibility="6.4.000" expanded="true" height="76" name="Get Data" width="90" x="45" y="120">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="179" y="75">
<list key="attribute_values">
<parameter key="Text" value=""Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_data_user_specification" compatibility="6.4.000" expanded="true" height="60" name="Generate Data by User Specification (2)" width="90" x="179" y="165">
<list key="attribute_values">
<parameter key="Text" value=""Lorem ipsum""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="append" compatibility="6.4.000" expanded="true" height="94" name="Append" width="90" x="313" y="75"/>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
<connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">Simply generate some test data</description>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="6.4.000" expanded="true" height="76" name="Nominal to Text" width="90" x="179" y="120"/>
<operator activated="true" class="text:data_to_documents" compatibility="6.4.001" expanded="true" height="60" name="Data to Documents" width="90" x="313" y="120">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="6.4.001" expanded="true" height="94" name="Process Documents" width="90" x="447" y="120">
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="6.4.001" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="6.4.001" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
<operator activated="true" class="text:stem_porter" compatibility="6.4.001" expanded="true" height="60" name="Stem (Porter)" width="90" x="447" y="30"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="multiply" compatibility="6.4.000" expanded="true" height="94" name="Multiply" width="90" x="648" y="120"/>
<operator activated="true" class="cross_distances" compatibility="6.4.000" expanded="true" height="94" name="Cross Distances" width="90" x="782" y="120">
<parameter key="measure_types" value="NumericalMeasures"/>
</operator>
<connect from_op="Get Data" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Cross Distances" to_port="request set"/>
<connect from_op="Multiply" from_port="output 2" to_op="Cross Distances" to_port="reference set"/>
<connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

rahi84 · August 2015

Thank you I've solved this. This issue was that the data was not in the type text. The Nominal to Text node helped that.

MartinLiebig · August 2015

This sounds pretty reasonable.

Could you post the XML of your process? Then i could check way easier for the mistake.

Cheers,
Martin

rahi84 · August 2015

Hi Martin,

I have 'blacked out' the directory for privacy.

Please see below the XML code:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="6.4.000" expanded="true" height="60" name="Read Excel" width="90" x="112" y="120">
<parameter key="excel_file" value="C:\Users\elie.rahi\Desktop\############\###############\###########.xlsx"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="6.4.001" expanded="true" height="60" name="Data to Documents" width="90" x="246" y="120">
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="6.4.001" expanded="true" height="94" name="Process Documents" width="90" x="380" y="255">
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="6.4.001" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="6.4.001" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="313" y="30"/>
<operator activated="true" class="text:stem_porter" compatibility="6.4.001" expanded="true" height="60" name="Stem (Porter)" width="90" x="447" y="30"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="6.4.000" expanded="true" height="76" name="Data to Similarity" width="90" x="581" y="255">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

mehak · February 2017

hello... Please help me how to cluster similar meaning words in a document. please help me. its really urgent.

Telcontar120 · February 2017

There are a number of different ways that you might approach that, but if you have a relatively short list of synonymous words/tokens, then you can use the "Replace Token" operator inside the "Process Documents" operator. It allows you to map a set of related tokens to a single token that represents the set. You can create as many entries as you want.

If you need something more complicated, there is a synonym finding operator from the Wordnet extension which is available for free in the RapidMiner marketplace.

mehak · February 2017

thank you so much for your response. can you please tell me how to make cluster of all of them?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Text Mining - Document Similarity/Clustering"

Best Answers

Answers