RAPIDMINER 9.7 BETA ANNOUNCEMENT

The beta program for the RapidMiner 9.7 release is now available. Lots of amazing new improvements including true version control!

CLICK HERE TO DOWNLOAD

"Two Documents Similarity using Cross distance"

asafwatasafwat Member Posts: 4 Contributor I
edited June 2019 in Help
  1. texti am using rapidminer to compare the similarity between two text fields in two  sheets in same excel file using cross distance, as i want to compart one request will all referernce to return the similarity value by cosine similarity, the problem is the distance returns as question mark '?' without knowing the reason 
<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="8.2.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="391">
<parameter key="excel_file" value="/Users/macbook/Desktop/ULS/Change Management in ULS/WASP_Requirements.xlsx"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="1"/>
<parameter key="imported_cell_range" value="A1:B72"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="A.true.polynominal.id"/>
<parameter key="1" value="B.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="true"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="true" class="remove_duplicates" compatibility="8.2.001" expanded="true" height="103" name="Remove Duplicates" width="90" x="45" y="493">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="treat_missing_values_as_duplicates" value="false"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="391">
<parameter key="attribute_name" value="A"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles">
<parameter key="B" value="regular"/>
<parameter key="A" value="id"/>
</list>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.2.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="493">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="391">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights">
<parameter key="B" value="1.0"/>
</list>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
<parameter key="mode" value="linguistic tokens"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="112" y="136"/>
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="112" y="238">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="wordnet:open_wordnet_dictionary" compatibility="5.3.000" expanded="true" height="68" name="Open WordNet Dictionary" width="90" x="313" y="391">
<parameter key="resource_type" value="directory"/>
<parameter key="directory" value="/Users/macbook/Downloads/WordNet-3.0/dict"/>
</operator>
<operator activated="true" class="wordnet:stem_wordnet" compatibility="5.3.000" expanded="true" height="82" name="Stem (WordNet)" width="90" x="313" y="238">
<parameter key="allow_ambiguity" value="true"/>
<parameter key="keep_unmatched_stems" value="true"/>
<parameter key="keep_unmatched_tokens" value="true"/>
<parameter key="work_on_type_noun" value="true"/>
<parameter key="work_on_type_verb" value="true"/>
<parameter key="work_on_type_adjective" value="true"/>
<parameter key="work_on_type_adverb" value="true"/>
</operator>
<operator activated="true" class="wordnet:find_synonym_wordnet" compatibility="5.3.000" expanded="true" height="82" name="Find Synonyms (WordNet)" width="90" x="447" y="238">
<parameter key="use_prefix" value="false"/>
<parameter key="synset_word_prefix" value="syn:"/>
<parameter key="maximum_recursion_depth" value="1"/>
<parameter key="multiple_meanings_per_word_policy" value="Take only first meaning"/>
<parameter key="multiple_synsets_policy" value="Take only first synset per meaning"/>
<parameter key="multiple_synset_words_policy" value="Take only first synset word"/>
<parameter key="concatenation" value="Concatenate result per synset"/>
<parameter key="keep_original_tokens" value="true"/>
<parameter key="keep_unmatched_tokens" value="true"/>
<parameter key="take_ID_instead_of_words" value="false"/>
<parameter key="work_on_type_noun" value="true"/>
<parameter key="work_on_type_verb" value="true"/>
<parameter key="work_on_type_adjective" value="true"/>
<parameter key="work_on_type_adverb" value="true"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Stem (WordNet)" to_port="document"/>
<connect from_op="Open WordNet Dictionary" from_port="dictionary" to_op="Stem (WordNet)" to_port="dictionary"/>
<connect from_op="Stem (WordNet)" from_port="document" to_op="Find Synonyms (WordNet)" to_port="document"/>
<connect from_op="Stem (WordNet)" from_port="dictionary" to_op="Find Synonyms (WordNet)" to_port="dictionary"/>
<connect from_op="Find Synonyms (WordNet)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="false" class="concurrency:k_means" compatibility="8.2.001" expanded="true" height="82" name="Clustering" width="90" x="782" y="34">
<parameter key="add_cluster_attribute" value="true"/>
<parameter key="add_as_label" value="false"/>
<parameter key="remove_unlabeled" value="false"/>
<parameter key="k" value="40"/>
<parameter key="max_runs" value="10"/>
<parameter key="determine_good_start_values" value="true"/>
<parameter key="measure_types" value="BregmanDivergences"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="EuclideanDistance"/>
<parameter key="divergence" value="SquaredEuclideanDistance"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
<parameter key="max_optimization_steps" value="100"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
</operator>
<operator activated="true" class="read_excel" compatibility="8.2.001" expanded="true" height="68" name="Read Excel (2)" width="90" x="45" y="85">
<parameter key="excel_file" value="/Users/macbook/Desktop/ULS/Change Management in ULS/WASP_Requirements.xlsx"/>
<parameter key="sheet_selection" value="sheet number"/>
<parameter key="sheet_number" value="2"/>
<parameter key="imported_cell_range" value="A1:B1"/>
<parameter key="encoding" value="SYSTEM"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<parameter key="date_format" value=""/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="locale" value="English (United States)"/>
<parameter key="read_all_values_as_polynominal" value="false"/>
<list key="data_set_meta_data_information">
<parameter key="0" value="A.true.polynominal.id"/>
<parameter key="1" value="B.true.polynominal.attribute"/>
</list>
<parameter key="read_not_matching_values_as_missings" value="true"/>
<parameter key="datamanagement" value="double_array"/>
<parameter key="data_management" value="auto"/>
</operator>
<operator activated="true" class="remove_duplicates" compatibility="8.2.001" expanded="true" height="103" name="Remove Duplicates (2)" width="90" x="45" y="187">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="treat_missing_values_as_duplicates" value="false"/>
</operator>
<operator activated="true" class="set_role" compatibility="8.2.001" expanded="true" height="82" name="Set Role (2)" width="90" x="179" y="85">
<parameter key="attribute_name" value="A"/>
<parameter key="target_role" value="id"/>
<list key="set_additional_roles">
<parameter key="B" value="regular"/>
<parameter key="A" value="id"/>
</list>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="8.2.001" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="179" y="187">
<parameter key="attribute_filter_type" value="all"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="313" y="85">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_absolute" value="2"/>
<parameter key="prune_above_absolute" value="9999"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights">
<parameter key="B" value="1.0"/>
</list>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="34">
<parameter key="mode" value="linguistic tokens"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="English"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="112" y="136"/>
<operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="112" y="238">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="wordnet:open_wordnet_dictionary" compatibility="5.3.000" expanded="true" height="68" name="Open WordNet Dictionary (2)" width="90" x="313" y="391">
<parameter key="resource_type" value="directory"/>
<parameter key="directory" value="/Users/macbook/Downloads/WordNet-3.0/dict"/>
</operator>
<operator activated="true" class="wordnet:stem_wordnet" compatibility="5.3.000" expanded="true" height="82" name="Stem (2)" width="90" x="313" y="238">
<parameter key="allow_ambiguity" value="true"/>
<parameter key="keep_unmatched_stems" value="true"/>
<parameter key="keep_unmatched_tokens" value="true"/>
<parameter key="work_on_type_noun" value="true"/>
<parameter key="work_on_type_verb" value="true"/>
<parameter key="work_on_type_adjective" value="true"/>
<parameter key="work_on_type_adverb" value="true"/>
</operator>
<operator activated="true" class="wordnet:find_synonym_wordnet" compatibility="5.3.000" expanded="true" height="82" name="Find Synonyms (2)" width="90" x="447" y="238">
<parameter key="use_prefix" value="false"/>
<parameter key="synset_word_prefix" value="syn:"/>
<parameter key="maximum_recursion_depth" value="1"/>
<parameter key="multiple_meanings_per_word_policy" value="Take only first meaning"/>
<parameter key="multiple_synsets_policy" value="Take only first synset per meaning"/>
<parameter key="multiple_synset_words_policy" value="Take only first synset word"/>
<parameter key="concatenation" value="Concatenate result per synset"/>
<parameter key="keep_original_tokens" value="true"/>
<parameter key="keep_unmatched_tokens" value="true"/>
<parameter key="take_ID_instead_of_words" value="false"/>
<parameter key="work_on_type_noun" value="true"/>
<parameter key="work_on_type_verb" value="true"/>
<parameter key="work_on_type_adjective" value="true"/>
<parameter key="work_on_type_adverb" value="true"/>
</operator>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
<connect from_op="Open WordNet Dictionary (2)" from_port="dictionary" to_op="Stem (2)" to_port="dictionary"/>
<connect from_op="Stem (2)" from_port="document" to_op="Find Synonyms (2)" to_port="document"/>
<connect from_op="Stem (2)" from_port="dictionary" to_op="Find Synonyms (2)" to_port="dictionary"/>
<connect from_op="Find Synonyms (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="multiply" compatibility="8.2.001" expanded="true" height="103" name="Multiply" width="90" x="447" y="85"/>
<operator activated="true" class="order_attributes" compatibility="8.2.001" expanded="true" height="82" name="Reorder Attributes" width="90" x="447" y="238">
<parameter key="sort_mode" value="reference data"/>
<parameter key="attribute_ordering" value=""/>
<parameter key="use_regular_expressions" value="false"/>
<parameter key="handle_unmatched" value="append"/>
<parameter key="sort_direction" value="ascending"/>
</operator>
<operator activated="true" class="multiply" compatibility="8.2.001" expanded="true" height="103" name="Multiply (2)" width="90" x="581" y="340"/>
<operator activated="true" class="order_attributes" compatibility="8.2.001" expanded="true" height="82" name="Reorder Attributes (2)" width="90" x="581" y="238">
<parameter key="sort_mode" value="reference data"/>
<parameter key="attribute_ordering" value=""/>
<parameter key="use_regular_expressions" value="false"/>
<parameter key="handle_unmatched" value="append"/>
<parameter key="sort_direction" value="ascending"/>
</operator>
<operator activated="true" class="cross_distances" compatibility="8.2.001" expanded="true" height="103" name="Cross Distances" width="90" x="715" y="238">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="SimpleMatchingSimilarity"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
<parameter key="only_top_k" value="false"/>
<parameter key="k" value="10"/>
<parameter key="search_for" value="nearest"/>
<parameter key="compute_similarities" value="true"/>
</operator>
<operator activated="false" class="data_to_similarity" compatibility="8.2.001" expanded="true" height="82" name="Data to Similarity" width="90" x="648" y="34">
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="mixed_measure" value="MixedEuclideanDistance"/>
<parameter key="nominal_measure" value="NominalDistance"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
<parameter key="divergence" value="GeneralizedIDivergence"/>
<parameter key="kernel_type" value="radial"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_degree" value="3.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Remove Duplicates" to_port="example set input"/>
<connect from_op="Remove Duplicates" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Reorder Attributes" to_port="example set input"/>
<connect from_op="Read Excel (2)" from_port="output" to_op="Remove Duplicates (2)" to_port="example set input"/>
<connect from_op="Remove Duplicates (2)" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
<connect from_op="Set Role (2)" from_port="example set output" to_op="Nominal to Text (2)" to_port="example set input"/>
<connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
<connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Reorder Attributes (2)" to_port="example set input"/>
<connect from_op="Multiply" from_port="output 2" to_op="Reorder Attributes" to_port="reference_data"/>
<connect from_op="Reorder Attributes" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
<connect from_op="Multiply (2)" from_port="output 1" to_op="Reorder Attributes (2)" to_port="reference_data"/>
<connect from_op="Multiply (2)" from_port="output 2" to_op="Cross Distances" to_port="reference set"/>
<connect from_op="Reorder Attributes (2)" from_port="example set output" to_op="Cross Distances" to_port="request set"/>
<connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<description align="center" color="gray" colored="true" height="163" resized="true" width="142" x="28" y="320">Read Requirements Document</description>
<description align="center" color="gray" colored="true" height="147" resized="true" width="126" x="30" y="14">Read Requirements Change Requests</description>
</process>
</operator>
</process>

Screen Shot 2018-07-29 at 12.45.20 PM.pngScreen Shot 2018-07-29 at 12.46.20 PM.pngScreen Shot 2018-07-29 at 12.46.37 PM.png

Best Answers

  • lionelderkrikorlionelderkrikor Posts: 1,051   Unicorn
    Solution Accepted

    Hi @asafwat,

     

    I think I found elements of answers (now calculated distances/similarities have numerical values) : 

     

    In the documentation of the Cross-Distances operator it is said that : 

    "Please note that both input ExampleSets should have the same attributes and in the same order".

    So you have to use a Superset (cf documentation of this operator) operator to feed the req and ref ports of the Cross-Distances operator  with 2 datasets which have strictly the same attributes.

    Moreover, I made some modifications in your process : 

     - in the Process Documents from Data operators : vector creation -> Term Occurences.

     - in the Tokenize operators : mode -> non letters.

     

    The process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.000-BETA">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.000-BETA" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="9.0.000-BETA" expanded="true" height="68" name="Read Excel" width="90" x="45" y="391">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Cross_Distances\Cross_Distances.xlsx"/>
    <list key="annotations"/>
    <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Id.true.integer.attribute"/>
    <parameter key="1" value="Text.true.polynominal.attribute"/>
    </list>
    <parameter key="read_not_matching_values_as_missings" value="false"/>
    </operator>
    <operator activated="true" class="remove_duplicates" compatibility="9.0.000-BETA" expanded="true" height="103" name="Remove Duplicates" width="90" x="45" y="493"/>
    <operator activated="true" class="set_role" compatibility="9.0.000-BETA" expanded="true" height="82" name="Set Role" width="90" x="179" y="391">
    <parameter key="attribute_name" value="Id"/>
    <parameter key="target_role" value="id"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="9.0.000-BETA" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="493"/>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="391">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_below_absolute" value="2"/>
    <parameter key="prune_above_absolute" value="9999"/>
    <list key="specify_weights">
    <parameter key="B" value="1.0"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="112" y="136"/>
    <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="112" y="238"/>
    <operator activated="false" class="wordnet:open_wordnet_dictionary" compatibility="5.3.000" expanded="true" height="68" name="Open WordNet Dictionary" width="90" x="313" y="493">
    <parameter key="directory" value="/Users/macbook/Downloads/WordNet-3.0/dict"/>
    </operator>
    <operator activated="false" class="wordnet:stem_wordnet" compatibility="5.3.000" expanded="true" height="82" name="Stem (WordNet)" width="90" x="313" y="340">
    <parameter key="allow_ambiguity" value="true"/>
    <parameter key="keep_unmatched_stems" value="true"/>
    <parameter key="keep_unmatched_tokens" value="true"/>
    </operator>
    <operator activated="false" class="wordnet:find_synonym_wordnet" compatibility="5.3.000" expanded="true" height="82" name="Find Synonyms (WordNet)" width="90" x="447" y="340">
    <parameter key="use_prefix" value="false"/>
    <parameter key="multiple_meanings_per_word_policy" value="Take only first meaning"/>
    <parameter key="keep_original_tokens" value="true"/>
    <parameter key="keep_unmatched_tokens" value="true"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_port="document 1"/>
    <connect from_op="Open WordNet Dictionary" from_port="dictionary" to_op="Stem (WordNet)" to_port="dictionary"/>
    <connect from_op="Stem (WordNet)" from_port="document" to_op="Find Synonyms (WordNet)" to_port="document"/>
    <connect from_op="Stem (WordNet)" from_port="dictionary" to_op="Find Synonyms (WordNet)" to_port="dictionary"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="false" class="concurrency:k_means" compatibility="9.0.000-BETA" expanded="true" height="82" name="Clustering" width="90" x="782" y="34">
    <parameter key="k" value="40"/>
    <parameter key="determine_good_start_values" value="true"/>
    </operator>
    <operator activated="true" class="read_excel" compatibility="9.0.000-BETA" expanded="true" height="68" name="Read Excel (2)" width="90" x="45" y="85">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Cross_Distances\Cross_Distances.xlsx"/>
    <parameter key="sheet_number" value="2"/>
    <list key="annotations"/>
    <parameter key="date_format" value="MMM d, yyyy h:mm:ss a z"/>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="Id.true.integer.attribute"/>
    <parameter key="1" value="text.true.polynominal.attribute"/>
    </list>
    <parameter key="read_not_matching_values_as_missings" value="false"/>
    </operator>
    <operator activated="true" class="remove_duplicates" compatibility="9.0.000-BETA" expanded="true" height="103" name="Remove Duplicates (2)" width="90" x="45" y="187"/>
    <operator activated="true" class="set_role" compatibility="9.0.000-BETA" expanded="true" height="82" name="Set Role (2)" width="90" x="179" y="85">
    <parameter key="attribute_name" value="Id"/>
    <parameter key="target_role" value="id"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="9.0.000-BETA" expanded="true" height="82" name="Nominal to Text (2)" width="90" x="179" y="187">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data (2)" width="90" x="313" y="85">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_below_absolute" value="2"/>
    <parameter key="prune_above_absolute" value="9999"/>
    <list key="specify_weights">
    <parameter key="B" value="1.0"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="34"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="112" y="136"/>
    <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases (2)" width="90" x="112" y="238"/>
    <operator activated="false" class="wordnet:open_wordnet_dictionary" compatibility="5.3.000" expanded="true" height="68" name="Open WordNet Dictionary (2)" width="90" x="313" y="544">
    <parameter key="directory" value="/Users/macbook/Downloads/WordNet-3.0/dict"/>
    </operator>
    <operator activated="false" class="wordnet:stem_wordnet" compatibility="5.3.000" expanded="true" height="82" name="Stem (2)" width="90" x="313" y="391">
    <parameter key="allow_ambiguity" value="true"/>
    <parameter key="keep_unmatched_stems" value="true"/>
    <parameter key="keep_unmatched_tokens" value="true"/>
    </operator>
    <operator activated="false" class="wordnet:find_synonym_wordnet" compatibility="5.3.000" expanded="true" height="82" name="Find Synonyms (2)" width="90" x="447" y="391">
    <parameter key="use_prefix" value="false"/>
    <parameter key="multiple_meanings_per_word_policy" value="Take only first meaning"/>
    <parameter key="keep_original_tokens" value="true"/>
    <parameter key="keep_unmatched_tokens" value="true"/>
    </operator>
    <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
    <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
    <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
    <connect from_op="Transform Cases (2)" from_port="document" to_port="document 1"/>
    <connect from_op="Open WordNet Dictionary (2)" from_port="dictionary" to_op="Stem (2)" to_port="dictionary"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="multiply" compatibility="9.0.000-BETA" expanded="true" height="103" name="Multiply" width="90" x="447" y="85"/>
    <operator activated="true" class="order_attributes" compatibility="9.0.000-BETA" expanded="true" height="82" name="Reorder Attributes" width="90" x="447" y="238">
    <parameter key="sort_mode" value="reference data"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="9.0.000-BETA" expanded="true" height="103" name="Multiply (2)" width="90" x="581" y="340"/>
    <operator activated="true" class="order_attributes" compatibility="9.0.000-BETA" expanded="true" height="82" name="Reorder Attributes (2)" width="90" x="581" y="187">
    <parameter key="sort_mode" value="reference data"/>
    </operator>
    <operator activated="false" class="data_to_similarity" compatibility="9.0.000-BETA" expanded="true" height="82" name="Data to Similarity" width="90" x="648" y="34">
    <parameter key="measure_types" value="NumericalMeasures"/>
    <parameter key="numerical_measure" value="CosineSimilarity"/>
    </operator>
    <operator activated="true" class="superset" compatibility="9.0.000-BETA" expanded="true" height="82" name="Superset" width="90" x="782" y="238"/>
    <operator activated="true" class="cross_distances" compatibility="9.0.000-BETA" expanded="true" height="103" name="Cross Distances" width="90" x="916" y="238">
    <parameter key="measure_types" value="NumericalMeasures"/>
    <parameter key="nominal_measure" value="SimpleMatchingSimilarity"/>
    <parameter key="numerical_measure" value="CosineSimilarity"/>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_op="Remove Duplicates" to_port="example set input"/>
    <connect from_op="Remove Duplicates" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Reorder Attributes" to_port="example set input"/>
    <connect from_op="Read Excel (2)" from_port="output" to_op="Remove Duplicates (2)" to_port="example set input"/>
    <connect from_op="Remove Duplicates (2)" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
    <connect from_op="Set Role (2)" from_port="example set output" to_op="Nominal to Text (2)" to_port="example set input"/>
    <connect from_op="Nominal to Text (2)" from_port="example set output" to_op="Process Documents from Data (2)" to_port="example set"/>
    <connect from_op="Process Documents from Data (2)" from_port="example set" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Reorder Attributes (2)" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Reorder Attributes" to_port="reference_data"/>
    <connect from_op="Reorder Attributes" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Reorder Attributes (2)" to_port="reference_data"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_op="Superset" to_port="example set 2"/>
    <connect from_op="Reorder Attributes (2)" from_port="example set output" to_op="Superset" to_port="example set 1"/>
    <connect from_op="Superset" from_port="superset 1" to_op="Cross Distances" to_port="request set"/>
    <connect from_op="Superset" from_port="superset 2" to_op="Cross Distances" to_port="reference set"/>
    <connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <description align="center" color="gray" colored="true" height="163" resized="true" width="142" x="28" y="320">Read Requirements Document</description>
    <description align="center" color="gray" colored="true" height="147" resized="true" width="126" x="30" y="14">Read Requirements Change Requests</description>
    </process>
    </operator>
    </process>

    I hope it helps,

     

    Regards,

     

    Lionel

     

     

     

  • lionelderkrikorlionelderkrikor Posts: 1,051   Unicorn
    Solution Accepted

    Hi (one more time ...) @asafwat,

     

    Just a (last ?) little advice, you don't need to specify that an attribute is "regular" in the Set Role operator : 

    By default, RapidMiner set automatically an attribute as "regular"...

     

    Regards,

     

    Lionel

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,051   Unicorn

    Hi @asafwat,

     

    Are your attributes "numerical" ?

    Can you share your dataset(s) in order we can reproduce what you observe ?

     

    Regards,

     

    Lionel

    asafwat
  • asafwatasafwat Member Posts: 4 Contributor I

    Sure, here is it, i have changed it to csv in order to attach it 

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,051   Unicorn

    Hi again @asafwat,

     

    I have difficulties with your CSV file, can you send me your original Excel file by : 

     

     - zipping it, then, attaching it to this post

     - sending your Excel file on Google Drive and then copy and share the link here in the forum

     

     

    Regards,

     

    Lionel 

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,051   Unicorn

    Hi again (and again) @asafwat,

     

    Can you send me your Wordnet dictionnary too (by zipping it for example).

     

    Regards,

     

     

    Lionel

  • asafwatasafwat Member Posts: 4 Contributor I

    @lionelderkrikor wooow it works, great efforts, really you made my day. much apperciated

    Thanks a lot

Sign In or Register to comment.