The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
How to compare similarity of large number of documents
Hello,
I'm looking for a way to find the similarities between a large number of documents to each other, i.e., similarity document A to B, similarity A to C, B to C, etc. I have been using the Text Mining extension.
The process I have been using is:
Retrieve > Nominal to Text > Data to Documents > Process documents (TF_IDF) (+Tokenize) > Data to Similarity (CosineSimilarity)
The documents are short, under 30 words.
There are about 1200 documents.
This works for a small number of documents, normally in 2-3 seconds. However, when I try to use it for all of the 1200 documents, RapidMIner says the process is completed in 0 seconds and then doesn't show any results. The bar on the bottom right remains frozen on "Creating Displays." Program stops working.
Does this happen because there are too many results for the operation? If so, what is the correct approach?
Help would be very much appreciated.
This is the full process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.014">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="521" width="748">
<operator activated="true" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Repository1/Martyrs/Data/document similarity test data"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.1.014" expanded="true" height="76" name="Nominal to Text" width="90" x="112" y="120">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="C"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="5.1.004" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="210">
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.004" expanded="true" height="94" name="Process Documents" width="90" x="246" y="300">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="false"/>
<parameter key="prune_method" value="none"/>
<parameter key="prunde_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_rank" value="5.0"/>
<parameter key="prune_above_rank" value="5.0"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="parallelize_vector_creation" value="false"/>
<process expanded="true" height="610" width="980">
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="181" y="42">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="5.1.014" expanded="true" height="76" name="Data to Similarity" width="90" x="313" y="435">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I'm looking for a way to find the similarities between a large number of documents to each other, i.e., similarity document A to B, similarity A to C, B to C, etc. I have been using the Text Mining extension.
The process I have been using is:
Retrieve > Nominal to Text > Data to Documents > Process documents (TF_IDF) (+Tokenize) > Data to Similarity (CosineSimilarity)
The documents are short, under 30 words.
There are about 1200 documents.
This works for a small number of documents, normally in 2-3 seconds. However, when I try to use it for all of the 1200 documents, RapidMIner says the process is completed in 0 seconds and then doesn't show any results. The bar on the bottom right remains frozen on "Creating Displays." Program stops working.
Does this happen because there are too many results for the operation? If so, what is the correct approach?
Help would be very much appreciated.
This is the full process:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.014">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.014" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true" height="521" width="748">
<operator activated="true" class="retrieve" compatibility="5.1.014" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
<parameter key="repository_entry" value="//Repository1/Martyrs/Data/document similarity test data"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="5.1.014" expanded="true" height="76" name="Nominal to Text" width="90" x="112" y="120">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="C"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:data_to_documents" compatibility="5.1.004" expanded="true" height="60" name="Data to Documents" width="90" x="179" y="210">
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.004" expanded="true" height="94" name="Process Documents" width="90" x="246" y="300">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="false"/>
<parameter key="prune_method" value="none"/>
<parameter key="prunde_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_rank" value="5.0"/>
<parameter key="prune_above_rank" value="5.0"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="parallelize_vector_creation" value="false"/>
<process expanded="true" height="610" width="980">
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="181" y="42">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="data_to_similarity" compatibility="5.1.014" expanded="true" height="76" name="Data to Similarity" width="90" x="313" y="435">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
<connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_op="Data to Similarity" to_port="example set"/>
<connect from_op="Data to Similarity" from_port="similarity" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0
Answers
First, Rapidminer creates a compiled list of the tokens in all the documents
Second, based on that list, Rapidminer compares the similarity of document A to document B, then C, then D, ...
Third, Rapidminer compares similarity of document B to document C, then D, ...
Fourth, Rapidminer compares similarity of document C to document D, then E, ...
Problem is, I have no idea how to do this!
Eagerly awaiting your thoughts, and thank you.
If you input 1200 examples to the data to similarity operator you will get 1200*1199 pairs - 1.4 million rows - so you're probably getting memory issues. My suggestion is to use the similarity to data operator to turn the similarity result back into an example set and see if this displays more efficiently. If not, I would write the result to the repository, a database or a file and I would disconnect the result from the output so that it does not display at all.
You can then read the result later and use the filter or sample operators to extract the bits you're interested in.
regards
Andrew
I am able to get Similarity results (which has 3 columns first, second, similarity) with small number of rows on RapidMiner. But when I want to get higher number of row as result of similarity, I get same problem which says Creating Displays and waits forever
As your solution, I want to store similarity results in an excel file or in a database. For example if I want to add an Write to Excel operator, it does not accept similarity as an input. How can export these similarty results into an excel file?
Use the "simillarity to data" operator to convert to an example set
regards
Andrew
I would like to compare around 50000 different text cells from an Csv. i would like to find out which are the 5 most similiar data to the first text item.
As I understand the similaritytodata operator compares everything with everything but i would like to compare the first item to the rest.
Which other Operator can i use ?
THank you very much for your Help!
You could use the "cross distances" operator. It takes two example sets. The first would be the single item, the second would be examples to match against it. The result would be the distances between the single example and all the others.
regards
Andrew
Hi, I found this entry because I faced the same issue. It takes forever to get the output of cosine similiarity analysis out of 4100 documents. I followed some of the suggestions above and my flow is:
Read CSV--> Process documents from Data-->Data to similarity--> Similarity to Data--> Write Excel
After 24 hours it is still in the "Similarity to Data" process.
Any one has an idea how much time this will take? My PC characteristics are as follow:
Windows 10 entreprise Version 1607, 64 bit
Processor Intel Core i5-4310U
CPU 2,60 GHZ
RAM (8GB)
Thanks for any tip
Hello @roberto_r_herma - so process time varies a lot depending on many factors including your machine, the size and scope of the documents, etc... One thing that I can definitely tell you is that RapidMiner loves RAM and multiple core processors. FWIW, I just upgraded to 64GB of RAM with my 6-core Intel Xeon E5 to keep things humming along.
If I were you, I'd use the Sample operator and grab a small sample of your documents first. Benchmark the sample and then gently increase so you can get a sense if the full 4100 docs is going to take 2 days or 2 years.
Scott
Thanks for the tip!