Options

How to speed the 'cross distances' operator using binary data ?

sfmoraissfmorais Member Posts: 13 Contributor II
edited November 2018 in Help
Hi!

In the middle of my process I must to use many times the operator 'cross distances' to calculate the distance of one of the examples from the others (in a 'loop examples' operator)

My exampleSets have a average of 10000 rows(examples) by 15000 attributes with binary data (0 or 1).

The problem is that the 'cross distances' operator have long time to process the distances and it is increasingly slower.

My computer is recent (4Gb RAM and i7 processor)

Due of my data have a particular scale (0 or 1), is there any other way to speed it ? Or using any other more quicky operator used in text mining area, etc... ?


My short draw of my model is:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
    <parameter key="random_seed" value="-1"/>
    <process expanded="true" height="535" width="480">
      <operator activated="true" class="generate_data" compatibility="5.2.003" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
        <parameter key="number_examples" value="10000"/>
        <parameter key="number_of_attributes" value="15000"/>
        <parameter key="attributes_lower_bound" value="0.0"/>
        <parameter key="attributes_upper_bound" value="2.0"/>
        <parameter key="datamanagement" value="int_array"/>
      </operator>
      <operator activated="true" class="generate_id" compatibility="5.2.003" expanded="true" height="76" name="Generate ID" width="90" x="179" y="30"/>
      <operator activated="true" class="loop_examples" compatibility="5.2.003" expanded="true" height="94" name="Loop Examples" width="90" x="313" y="30">
        <process expanded="true" height="535" width="567">
          <operator activated="true" class="extract_macro" compatibility="5.2.003" expanded="true" height="60" name="Extract Macro" width="90" x="45" y="30">
            <parameter key="macro" value="id_value"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="id"/>
            <parameter key="example_index" value="%{example}"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="5.2.003" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="30">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="id=%{id_value}"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="5.2.003" expanded="true" height="76" name="Filter Examples (2)" width="90" x="313" y="165">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="id=%{id_value}"/>
            <parameter key="invert_filter" value="true"/>
          </operator>
          <operator activated="true" class="cross_distances" compatibility="5.2.003" expanded="true" height="94" name="Cross Distances" width="90" x="447" y="30">
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="numerical_measure" value="CosineSimilarity"/>
            <parameter key="only_top_k" value="true"/>
            <parameter key="search_for" value="farthest"/>
            <parameter key="compute_similarities" value="true"/>
          </operator>
          <connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Cross Distances" to_port="reference set"/>
          <connect from_op="Filter Examples" from_port="original" to_op="Filter Examples (2)" to_port="example set input"/>
          <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Cross Distances" to_port="request set"/>
          <connect from_op="Cross Distances" from_port="result set" to_port="output 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
Many thanks

Regards,
Sérgio
Sign In or Register to comment.