two data set

FarFar Member Posts: 3 Newbie
hi everyone,

i am trying to use two data set (training and testing) for applying a model. as my data is consisting of both text and structured attributes i divided it into two part (text and structured) and i stored both data separately. but when i am applying the model ( i need to use 3 model multiple regression, GBT and Neural Net) and i want to test the model with anothet data set which is test.data, i don't know how i can apply all processes to test data and check the model.


so, i used sub process operator and put all process are used for training data set and just sync it to apply model.

but i'm note sure i'm doing the write thing or not.
however i have to use both data set and i cannot use split operator instead.

can anyone help me with that?

Best Answer

Answers

  • varunm1varunm1 Member Posts: 733   Unicorn
    Hello @Far,

    Can you share XML code? To access the code, you need to go to View --> Show Panel --> XML and copy that and paste it here.

    Thanks
  • FarFar Member Posts: 3 Newbie
    edited September 19
    here is my xml:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Zomato-Text data" width="90" x="45" y="34">
            <parameter key="repository_entry" value="../../Zomato-Text data"/>
          </operator>
          <operator activated="true" class="split_data" compatibility="9.2.001" expanded="true" height="103" name="Split Data" width="90" x="45" y="187">
            <enumeration key="partitions">
              <parameter key="ratio" value="0.7"/>
              <parameter key="ratio" value="0.3"/>
            </enumeration>
            <parameter key="sampling_type" value="automatic"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          </operator>
          <operator activated="true" class="h2o:gradient_boosted_trees" compatibility="9.2.000" expanded="true" height="103" name="Gradient Boosted Trees" width="90" x="179" y="34">
            <parameter key="number_of_trees" value="20"/>
            <parameter key="reproducible" value="false"/>
            <parameter key="maximum_number_of_threads" value="4"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
            <parameter key="maximal_depth" value="5"/>
            <parameter key="min_rows" value="10.0"/>
            <parameter key="min_split_improvement" value="0.0"/>
            <parameter key="number_of_bins" value="20"/>
            <parameter key="learning_rate" value="0.1"/>
            <parameter key="sample_rate" value="1.0"/>
            <parameter key="distribution" value="AUTO"/>
            <parameter key="early_stopping" value="false"/>
            <parameter key="stopping_rounds" value="1"/>
            <parameter key="stopping_metric" value="AUTO"/>
            <parameter key="stopping_tolerance" value="0.001"/>
            <parameter key="max_runtime_seconds" value="0"/>
            <list key="expert_parameters"/>
          </operator>
          <operator activated="true" class="apply_model" compatibility="9.2.001" expanded="true" height="82" name="Apply Model" width="90" x="313" y="187">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="performance_regression" compatibility="9.2.001" expanded="true" height="82" name="Performance" width="90" x="380" y="34">
            <parameter key="main_criterion" value="first"/>
            <parameter key="root_mean_squared_error" value="true"/>
            <parameter key="absolute_error" value="true"/>
            <parameter key="relative_error" value="false"/>
            <parameter key="relative_error_lenient" value="false"/>
            <parameter key="relative_error_strict" value="false"/>
            <parameter key="normalized_absolute_error" value="false"/>
            <parameter key="root_relative_squared_error" value="false"/>
            <parameter key="squared_error" value="false"/>
            <parameter key="correlation" value="true"/>
            <parameter key="squared_correlation" value="true"/>
            <parameter key="prediction_average" value="false"/>
            <parameter key="spearman_rho" value="false"/>
            <parameter key="kendall_tau" value="false"/>
            <parameter key="skip_undefined_labels" value="true"/>
            <parameter key="use_example_weights" value="true"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.2.001" expanded="true" height="82" name="Calculate Residuals" width="90" x="447" y="136">
            <list key="function_descriptions">
              <parameter key="Residual" value="rate-[prediction(rate)]"/>
              <parameter key="AbsResidual" value="abs(rate-[prediction(rate)])"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Zomato-testing" width="90" x="112" y="340">
            <parameter key="repository_entry" value="../../Zomato-testing"/>
          </operator>
          <operator activated="true" class="subprocess" compatibility="9.2.001" expanded="true" height="82" name="Subprocess" width="90" x="246" y="340">
            <process expanded="true">
              <operator activated="true" class="filter_examples" compatibility="9.2.001" expanded="true" height="103" name="Filter Examples (2)" width="90" x="45" y="238">
                <parameter key="parameter_expression" value=""/>
                <parameter key="condition_class" value="custom_filters"/>
                <parameter key="invert_filter" value="false"/>
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="reviews_text.is_not_missing."/>
                  <parameter key="filters_entry_key" value="rate.is_not_missing."/>
                  <parameter key="filters_entry_key" value="average_cost.is_not_missing."/>
                  <parameter key="filters_entry_key" value="menu_item.is_not_missing."/>
                  <parameter key="filters_entry_key" value="meal_type.is_not_missing."/>
                </list>
                <parameter key="filters_logic_and" value="true"/>
                <parameter key="filters_check_metadata" value="true"/>
              </operator>
              <operator activated="true" class="sample" compatibility="9.2.001" expanded="true" height="82" name="Sample" width="90" x="179" y="340">
                <parameter key="sample" value="relative"/>
                <parameter key="balance_data" value="false"/>
                <parameter key="sample_size" value="100"/>
                <parameter key="sample_ratio" value="0.5"/>
                <parameter key="sample_probability" value="0.1"/>
                <list key="sample_size_per_class"/>
                <list key="sample_ratio_per_class"/>
                <list key="sample_probability_per_class"/>
                <parameter key="use_local_random_seed" value="false"/>
                <parameter key="local_random_seed" value="1992"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="9.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="112" y="34">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value="|meal_type|rate|reviews_text|average_cost|votes"/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="set_role" compatibility="9.2.001" expanded="true" height="82" name="Set Role" width="90" x="246" y="136">
                <parameter key="attribute_name" value="rate"/>
                <parameter key="target_role" value="label"/>
                <list key="set_additional_roles"/>
              </operator>
              <operator activated="true" class="nominal_to_text" compatibility="9.2.001" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value="|reviews_text|meal_type"/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="nominal"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="file_path"/>
                <parameter key="block_type" value="single_value"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="single_value"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="text:process_document_from_data" compatibility="8.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
                <parameter key="create_word_vector" value="true"/>
                <parameter key="vector_creation" value="TF-IDF"/>
                <parameter key="add_meta_information" value="true"/>
                <parameter key="keep_text" value="false"/>
                <parameter key="prune_method" value="none"/>
                <parameter key="prune_below_percent" value="3.0"/>
                <parameter key="prune_above_percent" value="30.0"/>
                <parameter key="prune_below_rank" value="0.05"/>
                <parameter key="prune_above_rank" value="0.95"/>
                <parameter key="datamanagement" value="double_sparse_array"/>
                <parameter key="data_management" value="auto"/>
                <parameter key="select_attributes_and_weights" value="false"/>
                <list key="specify_weights"/>
                <process expanded="true">
                  <operator activated="true" class="text:transform_cases" compatibility="8.2.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34">
                    <parameter key="transform_to" value="lower case"/>
                  </operator>
                  <operator activated="true" class="text:tokenize" compatibility="8.2.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
                    <parameter key="mode" value="non letters"/>
                    <parameter key="characters" value=".:"/>
                    <parameter key="language" value="English"/>
                    <parameter key="max_token_length" value="3"/>
                  </operator>
                  <operator activated="true" class="text:stem_snowball" compatibility="8.2.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="313" y="34">
                    <parameter key="language" value="English"/>
                  </operator>
                  <operator activated="true" class="text:filter_stopwords_english" compatibility="8.2.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="447" y="34"/>
                  <operator activated="true" class="text:filter_by_length" compatibility="8.2.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="581" y="34">
                    <parameter key="min_chars" value="4"/>
                    <parameter key="max_chars" value="25"/>
                  </operator>
                  <connect from_port="document" to_op="Transform Cases" to_port="document"/>
                  <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
                  <connect from_op="Tokenize" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
                  <connect from_op="Stem (Snowball)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
                  <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
                  <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="weight_by_correlation" compatibility="9.2.001" expanded="true" height="82" name="Weight by Correlation" width="90" x="380" y="289">
                <parameter key="normalize_weights" value="false"/>
                <parameter key="sort_weights" value="true"/>
                <parameter key="sort_direction" value="ascending"/>
                <parameter key="squared_correlation" value="false"/>
              </operator>
              <operator activated="true" class="select_by_weights" compatibility="9.2.001" expanded="true" height="103" name="Select by Weights" width="90" x="581" y="34">
                <parameter key="weight_relation" value="top k"/>
                <parameter key="weight" value="1.0"/>
                <parameter key="k" value="30"/>
                <parameter key="p" value="0.5"/>
                <parameter key="deselect_unknown" value="true"/>
                <parameter key="use_absolute_weights" value="true"/>
              </operator>
              <operator activated="true" class="normalize" compatibility="9.2.001" expanded="true" height="103" name="Normalize" width="90" x="581" y="187">
                <parameter key="return_preprocessing_model" value="false"/>
                <parameter key="create_view" value="false"/>
                <parameter key="attribute_filter_type" value="all"/>
                <parameter key="attribute" value=""/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="numeric"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="real"/>
                <parameter key="block_type" value="value_series"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_series_end"/>
                <parameter key="invert_selection" value="false"/>
                <parameter key="include_special_attributes" value="false"/>
                <parameter key="method" value="range transformation"/>
                <parameter key="min" value="0.0"/>
                <parameter key="max" value="1.0"/>
                <parameter key="allow_negative_values" value="false"/>
              </operator>
              <operator activated="true" class="detect_outlier_distances" compatibility="9.2.001" expanded="true" height="82" name="Detect Outlier (Distances)" width="90" x="715" y="34">
                <parameter key="number_of_neighbors" value="7"/>
                <parameter key="number_of_outliers" value="10"/>
                <parameter key="distance_function" value="euclidian distance"/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="9.2.001" expanded="true" height="103" name="Filter Examples" width="90" x="715" y="187">
                <parameter key="parameter_expression" value=""/>
                <parameter key="condition_class" value="custom_filters"/>
                <parameter key="invert_filter" value="false"/>
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="outlier.equals.false"/>
                </list>
                <parameter key="filters_logic_and" value="true"/>
                <parameter key="filters_check_metadata" value="true"/>
              </operator>
              <operator activated="true" class="principal_component_analysis" compatibility="9.2.001" expanded="true" height="103" name="PCA" width="90" x="849" y="34">
                <parameter key="dimensionality_reduction" value="fixed number"/>
                <parameter key="variance_threshold" value="0.95"/>
                <parameter key="number_of_components" value="5"/>
              </operator>
              <connect from_port="in 1" to_op="Filter Examples (2)" to_port="example set input"/>
              <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Sample" to_port="example set input"/>
              <connect from_op="Sample" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
              <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
              <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
              <connect from_op="Process Documents from Data" from_port="example set" to_op="Weight by Correlation" to_port="example set"/>
              <connect from_op="Weight by Correlation" from_port="weights" to_op="Select by Weights" to_port="weights"/>
              <connect from_op="Weight by Correlation" from_port="example set" to_op="Select by Weights" to_port="example set input"/>
              <connect from_op="Select by Weights" from_port="example set output" to_op="Normalize" to_port="example set input"/>
              <connect from_op="Normalize" from_port="example set output" to_op="Detect Outlier (Distances)" to_port="example set input"/>
              <connect from_op="Detect Outlier (Distances)" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
              <connect from_op="Filter Examples" from_port="example set output" to_op="PCA" to_port="example set input"/>
              <connect from_op="PCA" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve Zomato-Text data" from_port="output" to_op="Split Data" to_port="example set"/>
          <connect from_op="Split Data" from_port="partition 1" to_op="Gradient Boosted Trees" to_port="training set"/>
          <connect from_op="Split Data" from_port="partition 2" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Gradient Boosted Trees" from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Apply Model" from_port="model" to_port="result 3"/>
          <connect from_op="Performance" from_port="performance" to_port="result 1"/>
          <connect from_op="Performance" from_port="example set" to_op="Calculate Residuals" to_port="example set input"/>
          <connect from_op="Calculate Residuals" from_port="example set output" to_port="result 2"/>
          <connect from_op="Retrieve Zomato-testing" from_port="output" to_op="Subprocess" to_port="in 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>


    thanks
  • FarFar Member Posts: 3 Newbie
    this is the path i'm following to use both training and testing data set
  • varunm1varunm1 Member Posts: 733   Unicorn
    Hello @Far

    Are you encountering any error or are you just asking us if this is the right way to do? 

    Your process looks fine based on my assumption that you already processed train data similar to test data earlier. 
    sgenzerTghadially
Sign In or Register to comment.