The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

How to speed the 'cross distances' operator using binary data ?

sfmoraissfmorais Member Posts: 13 Contributor II
edited November 2018 in Help

In the middle of my process I must to use many times the operator 'cross distances' to calculate the distance of one of the examples from the others (in a 'loop examples' operator)

My exampleSets have a average of 10000 rows(examples) by 15000 attributes with binary data (0 or 1).

The problem is that the 'cross distances' operator have long time to process the distances and it is increasingly slower.

My computer is recent (4Gb RAM and i7 processor)

Due of my data have a particular scale (0 or 1), is there any other way to speed it ? Or using any other more quicky operator used in text mining area, etc... ?

My short draw of my model is:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.003">
  <operator activated="true" class="process" compatibility="5.2.003" expanded="true" name="Process">
    <parameter key="random_seed" value="-1"/>
    <process expanded="true" height="535" width="480">
      <operator activated="true" class="generate_data" compatibility="5.2.003" expanded="true" height="60" name="Generate Data" width="90" x="45" y="30">
        <parameter key="number_examples" value="10000"/>
        <parameter key="number_of_attributes" value="15000"/>
        <parameter key="attributes_lower_bound" value="0.0"/>
        <parameter key="attributes_upper_bound" value="2.0"/>
        <parameter key="datamanagement" value="int_array"/>
      <operator activated="true" class="generate_id" compatibility="5.2.003" expanded="true" height="76" name="Generate ID" width="90" x="179" y="30"/>
      <operator activated="true" class="loop_examples" compatibility="5.2.003" expanded="true" height="94" name="Loop Examples" width="90" x="313" y="30">
        <process expanded="true" height="535" width="567">
          <operator activated="true" class="extract_macro" compatibility="5.2.003" expanded="true" height="60" name="Extract Macro" width="90" x="45" y="30">
            <parameter key="macro" value="id_value"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="id"/>
            <parameter key="example_index" value="%{example}"/>
          <operator activated="true" class="filter_examples" compatibility="5.2.003" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="30">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="id=%{id_value}"/>
          <operator activated="true" class="filter_examples" compatibility="5.2.003" expanded="true" height="76" name="Filter Examples (2)" width="90" x="313" y="165">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="id=%{id_value}"/>
            <parameter key="invert_filter" value="true"/>
          <operator activated="true" class="cross_distances" compatibility="5.2.003" expanded="true" height="94" name="Cross Distances" width="90" x="447" y="30">
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="numerical_measure" value="CosineSimilarity"/>
            <parameter key="only_top_k" value="true"/>
            <parameter key="search_for" value="farthest"/>
            <parameter key="compute_similarities" value="true"/>
          <connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
          <connect from_op="Extract Macro" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Cross Distances" to_port="reference set"/>
          <connect from_op="Filter Examples" from_port="original" to_op="Filter Examples (2)" to_port="example set input"/>
          <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Cross Distances" to_port="request set"/>
          <connect from_op="Cross Distances" from_port="result set" to_port="output 1"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="sink_example set" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
      <connect from_op="Generate Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
      <connect from_op="Generate ID" from_port="example set output" to_op="Loop Examples" to_port="example set"/>
      <connect from_op="Loop Examples" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
Many thanks

Sign In or Register to comment.