Options

How long can the process be?

NewplayerNewplayer Member Posts: 2 Newbie
hi, ı have a dataset which includes 5000 rows and 9 columnes. I am trying to do the process with filling the wrong/missing value by average. This process has not been finished. I have waited for at least 1 hour but still not finished. Is it normal? By the way, my computer is a mac pro which was produced in 2014. 

Answers

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    It is hard to know whether this is normal or not if you don't provide more details. With my MacBook Air I could typically fill the missing values for 12000 columns in less than 9 minutes. Please, post your XML so that we can take a look and see if there is anything we can do to help you optimize this process.

    Hope this helps.
  • Options
    NewplayerNewplayer Member Posts: 2 Newbie
    <?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" breakpoints="after" class="retrieve" compatibility="9.1.000" expanded="true" height="68" name="Retrieve adult" width="90" x="112" y="187">
            <parameter key="repository_entry" value="//Local Repository/data/adult"/>
          </operator>
          <operator activated="true" class="subprocess" compatibility="9.1.000" expanded="true" height="103" name="Subprocess" width="90" x="380" y="136">
            <process expanded="true">
              <operator activated="true" class="replace_missing_values" compatibility="9.1.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="45" y="238">
                <parameter key="return_preprocessing_model" value="false"/>
                <parameter key="create_view" value="false"/>
                <parameter key="attribute_filter_type" value="all"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <parameter key="default" value="average"/>
                <list key="columns"/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="238">
                <parameter key="parameter_expression" value=""/>
                <parameter key="condition_class" value="no_missing_labels"/>
                <parameter key="invert_filter" value="false"/>
                <list key="filters_list"/>
                <parameter key="filters_logic_and" value="true"/>
                <parameter key="filters_check_metadata" value="true"/>
              </operator>
              <operator activated="true" class="discretize_by_bins" compatibility="9.1.000" expanded="true" height="103" name="Discretize" width="90" x="246" y="34">
                <parameter key="return_preprocessing_model" value="false"/>
                <parameter key="create_view" value="false"/>
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value="hours-per-week|education-num|age"/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="numeric"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="real"/>
                <parameter key="block_type" value="value_series"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_series_end"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <parameter key="number_of_bins" value="4"/>
                <parameter key="define_boundaries" value="false"/>
                <parameter key="range_name_type" value="long"/>
                <parameter key="automatic_number_of_digits" value="true"/>
                <parameter key="number_of_digits" value="3"/>
              </operator>
              <operator activated="true" class="detect_outlier_distances" compatibility="9.1.000" expanded="true" height="82" name="Detect Outlier (Distances)" width="90" x="447" y="391">
                <parameter key="number_of_neighbors" value="1"/>
                <parameter key="number_of_outliers" value="2"/>
                <parameter key="distance_function" value="euclidian distance"/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="9.1.000" expanded="true" height="103" name="Filter Examples (2)" width="90" x="447" y="238">
                <parameter key="parameter_expression" value=""/>
                <parameter key="condition_class" value="custom_filters"/>
                <parameter key="invert_filter" value="false"/>
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="outlier.does_not_equal.true"/>
                </list>
                <parameter key="filters_logic_and" value="true"/>
                <parameter key="filters_check_metadata" value="true"/>
              </operator>
              <operator activated="true" class="multiply" compatibility="9.1.000" expanded="true" height="82" name="Multiply" width="90" x="447" y="34"/>
              <operator activated="true" class="select_attributes" compatibility="9.1.000" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="85">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="outlier"/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="weight_by_information_gain" compatibility="9.1.000" expanded="true" height="82" name="Weight by Information Gain" width="90" x="581" y="238">
                <parameter key="normalize_weights" value="true"/>
                <parameter key="sort_weights" value="true"/>
                <parameter key="sort_direction" value="descending"/>
              </operator>
              <operator activated="true" class="select_by_weights" compatibility="9.1.000" expanded="true" height="103" name="Select by Weights" width="90" x="715" y="238">
                <parameter key="weight_relation" value="top k"/>
                <parameter key="weight" value="1.0"/>
                <parameter key="k" value="5"/>
                <parameter key="p" value="0.5"/>
                <parameter key="deselect_unknown" value="true"/>
                <parameter key="use_absolute_weights" value="true"/>
              </operator>
              <connect from_port="in 1" to_op="Replace Missing Values" to_port="example set input"/>
              <connect from_op="Replace Missing Values" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
              <connect from_op="Filter Examples" from_port="example set output" to_op="Discretize" to_port="example set input"/>
              <connect from_op="Discretize" from_port="example set output" to_op="Detect Outlier (Distances)" to_port="example set input"/>
              <connect from_op="Detect Outlier (Distances)" from_port="example set output" to_op="Filter Examples (2)" to_port="example set input"/>
              <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Multiply" to_port="input"/>
              <connect from_op="Multiply" from_port="output 1" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Weight by Information Gain" to_port="example set"/>
              <connect from_op="Weight by Information Gain" from_port="weights" to_op="Select by Weights" to_port="weights"/>
              <connect from_op="Weight by Information Gain" from_port="example set" to_op="Select by Weights" to_port="example set input"/>
              <connect from_op="Select by Weights" from_port="example set output" to_port="out 1"/>
              <connect from_op="Select by Weights" from_port="weights" to_port="out 2"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
              <portSpacing port="sink_out 3" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve adult" from_port="output" to_op="Subprocess" to_port="in 1"/>
          <connect from_op="Subprocess" from_port="out 1" to_port="result 1"/>
          <connect from_op="Subprocess" from_port="out 2" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    Hi @Newplayer - so I looked at your XML and there is nothing wrong from what I can see without your data set. But why not just try it with a reduced number of attributes and see how long that takes first?

    Scott

  • Options
    David_ADavid_A Administrator, Moderator, Employee, RMResearcher, Member Posts: 297 RM Research
    Hi,

    I also tested your process with some similar testing data. Your set up looks good. What takes so long is the outlier detection, as it has to compare each combination of points.
    Take a look at the "Anomaly Detection" extension on the marketplace.  There are several more performant algorithms available. The only change you have to include is, that there you most often get an outlier score ("how outlier-ish is that point") and not a binary decision (outlier =yes/no).
Sign In or Register to comment.