Options

X-Distance of two identical vectors, and distance is not 0

tiramisusanntiramisusann Member Posts: 9 Contributor II
edited November 2018 in Help
Hi,

I'm using two identical texts (which are processed the same way) to calculate the distance between them. The vectors are supposed to be absolutely identical, but the X-distance-operator (Numerical measures --> cosine similarity) does not calculate the distance of 0, but of 0,045.

Why? Do you have an idea?

All the best!
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="retrieve" compatibility="5.3.015" expanded="true" height="60" name="Retrieve Repository_BI." width="90" x="45" y="30">
       <parameter key="repository_entry" value="//Test/Repository_BI."/>
     </operator>
     <operator activated="true" class="replace_missing_values" compatibility="5.3.015" expanded="true" height="94" name="Replace Missing Values" width="90" x="315" y="30">
       <parameter key="attribute_filter_type" value="single"/>
       <parameter key="attribute" value="TAKE_TEXT"/>
       <parameter key="default" value="value"/>
       <list key="columns"/>
       <parameter key="replenishment_value" value="No"/>
     </operator>
     <operator activated="true" class="filter_examples" compatibility="5.3.015" expanded="true" height="76" name="Filter Examples (2)" width="90" x="450" y="30">
       <parameter key="condition_class" value="attribute_value_filter"/>
       <parameter key="parameter_string" value="TAKE_TEXT=No"/>
       <parameter key="invert_filter" value="true"/>
     </operator>
     <operator activated="true" class="sample" compatibility="5.3.015" expanded="true" height="76" name="Sample" width="90" x="585" y="30">
       <parameter key="balance_data" value="true"/>
       <list key="sample_size_per_class">
         <parameter key="UP" value="400"/>
         <parameter key="DOWN" value="400"/>
       </list>
       <list key="sample_ratio_per_class"/>
       <list key="sample_probability_per_class"/>
     </operator>
     <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes" width="90" x="45" y="120">
       <parameter key="attribute_filter_type" value="subset"/>
       <parameter key="attributes" value="|TAKE_TEXT|HEADLINE"/>
     </operator>
     <operator activated="true" class="text:create_document" compatibility="5.3.002" expanded="true" height="60" name="Create Document" width="90" x="45" y="300">
       <parameter key="text" value="The recently launched corporate foundation of VWR International, LLC, celebrates&#10;its most charitable quarter yet&#10;PHILADELPHIA--(Business Wire)--&#10;VWR International, LLC announced today that its recently launched corporate&#10;foundation, the VWR Foundation, has granted over $60,000 since late December&#10;2009, making it the most charitable quarter yet for the one-year-old Foundation.&#10;The Foundation, which seeks to support research, science education, health and&#10;well-being initiatives across the globe, fulfilled grants to a diverse group of&#10;organizations from The Scripps Research Institute to Doctors Without Borders. &#10;&#10;&quot;Corporate responsibility has been a longstanding element that differentiates&#10;good organizations from great organizations; for VWR, this inspired our&#10;associates to establish the VWR Foundation,&quot; stated John M. Ballbach, Chairman,&#10;President and CEO of VWR and President of the VWR Foundation. &quot;Because enhancing&#10;the environments in which we work and live remains the Foundation`s paramount&#10;objective, our associates designed the Foundation to support areas of research,&#10;science education, health and well-being. These priorities are consistent with&#10;the synergies generated as a distributor of scientific supplies.&quot; &#10;&#10;Holding true to the Foundation`s mission to support innovative research&#10;initiatives, a contribution was made to The Scripps Research Institute, a&#10;research organization that is internationally recognized for its basic research&#10;in immunology, molecular and cellular biology, chemistry, neurosciences,&#10;autoimmune diseases, cardiovascular diseases, virology and synthetic vaccine&#10;development. &#10;&#10;Two grants were awarded this quarter in the area of Science Education. The first&#10;was awarded to Schmahl Science Workshop, an organization that networks with&#10;teachers and scientists throughout the country to provide hands-on science&#10;activities for kids in a free-form environment. The second grant was awarded to&#10;the Science Museum of Minnesota, a large regional science museum located in&#10;downtown St. Paul that provides science education to an audience of more than&#10;one million students and science enthusiasts per year. &#10;&#10;The VWR Foundation also made health and well-being a primary focus of its&#10;giving. In the wake of the devastating earthquake in Port-au-Prince earlier this&#10;year, the Foundation made a contribution to Doctors Without Borders to support&#10;the volunteer doctors and nurses providing urgent medical care to Haitian&#10;victims. In addition, the Foundation donated to Professionals Analyzing Pap&#10;Smears, Inc., a healthcare team composed of volunteer physicians, nurse&#10;practitioners, nurses and cyto-technologists that establish cervical cancer&#10;screening clinics in developing countries. &#10;&#10;Most notably, VWR International, LLC and the VWR Foundation joined together to&#10;host a silent auction at the company`s North American Sales Meeting earlier this&#10;year. All proceeds from this event were donated to the Center for Cancer and&#10;Blood Disorders at the Children`s Medical Center Dallas.&#10;&#10;About VWR Foundation&#10;&#10;The VWR Foundation was started by five associates of VWR International, LLC who&#10;wanted to make a difference in the areas in which they worked and lived. The&#10;Foundation was officially established in January 2009 and focuses on research,&#10;health and well-being and science education. For more information about the VWR&#10;Foundation, visit www.VWRfoundation.org. &#10;&#10;About VWR International, LLC&#10;&#10;VWR International, LLC, headquartered in West Chester, Pennsylvania, is a global&#10;laboratory supply and distribution company with worldwide sales in excess of&#10;$3.5 billion in 2009. VWR enables the advancement of the world`s most critical&#10;research through the distribution of a highly diversified product line to most&#10;of the world`s top pharmaceutical and biotech companies, as well as industrial,&#10;educational, and governmental organizations. With 150 years of industry&#10;experience, VWR offers a well-established distribution network that reaches&#10;thousands of specialized labs and facilities spanning the globe. VWR has over&#10;6,500 associates around the world working to streamline the way researchers&#10;across North America, Europe, and Asia stock and maintain their labs. In&#10;addition, VWR further supports its customers by providing onsite services,&#10;storeroom management, product procurement, supply chain systems integration, and&#10;technical services. &#10;&#10;For more information on VWR International, phone 1-800-932-5000, visit&#10;www.vwr.com, or write, VWR International, LLC, 1310 Goshen Parkway, P.O. Box&#10;2656, West Chester, PA 19380-0906. &#10;&#10;VWR and design are registered trademarks of VWR International, LLC. &#10; &#10;VWR International, LLC&#10;Valerie Collado, 610-429-2796&#10;valerie_collado@vwr.com&#10;or&#10;Brownstein Group&#10;Laura Van De Pette, 267-238-4118&#10;lvandepette@brownsteingroup.com&#10;&#10;&#10;&#10;Copyright Business Wire 2010 &#10; &#10;"/>
     </operator>
     <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes (2)" width="90" x="246" y="120">
       <parameter key="attribute_filter_type" value="single"/>
       <parameter key="attribute" value="TREND"/>
       <parameter key="attributes" value="|TAKE_TEXT|HEADLINE"/>
       <parameter key="invert_selection" value="true"/>
       <parameter key="include_special_attributes" value="true"/>
     </operator>
     <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="380" y="120">
       <parameter key="keep_text" value="true"/>
       <parameter key="prune_method" value="absolute"/>
       <parameter key="prune_below_absolute" value="3"/>
       <parameter key="prune_above_absolute" value="9999"/>
       <parameter key="select_attributes_and_weights" value="true"/>
       <list key="specify_weights">
         <parameter key="TAKE_TEXT" value="1.0"/>
         <parameter key="HEADLINE" value="1.0"/>
       </list>
       <parameter key="parallelize_vector_creation" value="true"/>
       <process expanded="true">
         <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
         <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="180" y="30"/>
         <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="315" y="30"/>
         <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (Porter)" width="90" x="450" y="30"/>
         <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="606" y="30">
           <parameter key="max_length" value="3"/>
         </operator>
         <connect from_port="document" to_op="Tokenize" to_port="document"/>
         <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
         <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
         <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
         <connect from_op="Stem (Porter)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
         <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="k_means" compatibility="5.3.015" expanded="true" height="76" name="Clustering (2)" width="90" x="514" y="120">
       <parameter key="k" value="60"/>
       <parameter key="measure_types" value="NumericalMeasures"/>
       <parameter key="max_optimization_steps" value="200"/>
     </operator>
     <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="246" y="255">
       <parameter key="keep_text" value="true"/>
       <process expanded="true">
         <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="313" y="210"/>
         <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="314" y="120"/>
         <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="315" y="30"/>
         <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (2)" width="90" x="450" y="30"/>
         <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.3.002" expanded="true" height="60" name="Generate n-Grams (2)" width="90" x="585" y="30">
           <parameter key="max_length" value="3"/>
         </operator>
         <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
         <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
         <connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
         <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
         <connect from_op="Stem (2)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
         <connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
         <portSpacing port="source_document" spacing="0"/>
         <portSpacing port="sink_document 1" spacing="0"/>
         <portSpacing port="sink_document 2" spacing="0"/>
       </process>
     </operator>
     <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="94" name="Multiply" width="90" x="447" y="255"/>
     <operator activated="true" class="apply_model" compatibility="5.3.015" expanded="true" height="76" name="Apply Model" width="90" x="648" y="165">
       <list key="application_parameters"/>
     </operator>
     <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join" width="90" x="179" y="435">
       <parameter key="use_id_attribute_as_key" value="false"/>
       <list key="key_attributes">
         <parameter key="cluster" value="cluster"/>
       </list>
     </operator>
     <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes (3)" width="90" x="313" y="435">
       <parameter key="attribute_filter_type" value="single"/>
       <parameter key="attribute" value="id"/>
       <parameter key="invert_selection" value="true"/>
       <parameter key="include_special_attributes" value="true"/>
     </operator>
     <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes (4)" width="90" x="447" y="435">
       <parameter key="attribute_filter_type" value="single"/>
       <parameter key="attribute" value="cluster"/>
       <parameter key="invert_selection" value="true"/>
       <parameter key="include_special_attributes" value="true"/>
     </operator>
     <operator activated="true" class="cross_distances" compatibility="5.3.015" expanded="true" height="94" name="Cross Distances" width="90" x="581" y="435">
       <parameter key="measure_types" value="NumericalMeasures"/>
       <parameter key="nominal_measure" value="DiceSimilarity"/>
       <parameter key="numerical_measure" value="CosineSimilarity"/>
       <parameter key="only_top_k" value="true"/>
       <parameter key="k" value="3"/>
     </operator>
     <connect from_op="Retrieve Repository_BI." from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
     <connect from_op="Replace Missing Values" from_port="example set output" to_op="Filter Examples (2)" to_port="example set input"/>
     <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample" to_port="example set input"/>
     <connect from_op="Sample" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
     <connect from_op="Select Attributes" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
     <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
     <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
     <connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering (2)" to_port="example set"/>
     <connect from_op="Process Documents from Data" from_port="word list" to_op="Process Documents" to_port="word list"/>
     <connect from_op="Clustering (2)" from_port="cluster model" to_op="Apply Model" to_port="model"/>
     <connect from_op="Clustering (2)" from_port="clustered set" to_op="Join" to_port="left"/>
     <connect from_op="Process Documents" from_port="example set" to_op="Multiply" to_port="input"/>
     <connect from_op="Multiply" from_port="output 1" to_op="Cross Distances" to_port="request set"/>
     <connect from_op="Multiply" from_port="output 2" to_op="Apply Model" to_port="unlabelled data"/>
     <connect from_op="Apply Model" from_port="labelled data" to_op="Join" to_port="right"/>
     <connect from_op="Join" from_port="join" to_op="Select Attributes (3)" to_port="example set input"/>
     <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Select Attributes (4)" to_port="example set input"/>
     <connect from_op="Select Attributes (4)" from_port="example set output" to_op="Cross Distances" to_port="reference set"/>
     <connect from_op="Cross Distances" from_port="result set" to_port="result 1"/>
     <connect from_op="Cross Distances" from_port="request set" to_port="result 2"/>
     <connect from_op="Cross Distances" from_port="reference set" to_port="result 3"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
     <portSpacing port="sink_result 3" spacing="0"/>
     <portSpacing port="sink_result 4" spacing="0"/>
   </process>
 </operator>
</process>
Sign In or Register to comment.