Clustering

MarlaBotMarlaBot The Friendly RapidMiner Dog BotAdministrator, Moderator, Employee, Member Posts: 57 Community Manager
edited June 2019 in Help
A RapidMiner user wants to know the answer to this question: "Hey, I am looking to run a clustering model but all my data is qualitative. I was wondering if RapidMiner supports clustering algorithms for qualitative data?"
Tagged:
WalterRiojasgenzer

Answers

  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello @MarlaBot

    If all the measures are nominal (qualitative), k-means operator with measures type nominalMeasures and distance as nominalDistance works. 

    Hope this helps
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

    sgenzer
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Yes, many of our cluster algorithms do in fact - just make sure you use a distance measure which supports nominal (qualitative) column types.  This is the default for most and if you use clustering in Auto Model it will take care of this for you. In addition, there are data transformations you can apply to transform your data into numerical formats before you use any of the clustering algorithms.
    Hope this helps,
    Ingo
    [Deleted User]
  • WalterRiojaWalterRioja Member Posts: 8 Contributor I
    @IngoRM when I run automodel it only considers de quantitative attribute to cluster. What would be your suggestion? Thanks in advance!

  • WalterRiojaWalterRioja Member Posts: 8 Contributor I
    edited June 2019
    @varunm1 would you mind helping me more? I tried but it shows a message of non-nominal attribute even when it's text type. Any suggestion?


  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello @WalterRioja

    Did you try adding "text to nominal" operator before clustering algorithm?

    I think that will do it. 
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • WalterRiojaWalterRioja Member Posts: 8 Contributor I
    @varunm1 it's not working yet. Would you mind executing it as you explained? I can load my data here, please.
  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited June 2019
    sure then, provide your data and XML process (View --> Show Panel --> XML).
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • WalterRiojaWalterRioja Member Posts: 8 Contributor I
    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve ClustersNominalData" width="90" x="112" y="34">
            <parameter key="repository_entry" value="//Local Repository/ClustersNominalData"/>
          </operator>
          <operator activated="true" class="text_to_nominal" compatibility="9.3.001" expanded="true" height="82" name="Text to Nominal" width="90" x="246" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="text"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="text"/>
            <parameter key="block_type" value="value_matrix"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="concurrency:k_means" compatibility="9.3.001" expanded="true" height="82" name="Clustering" width="90" x="380" y="34">
            <parameter key="add_cluster_attribute" value="true"/>
            <parameter key="add_as_label" value="false"/>
            <parameter key="remove_unlabeled" value="false"/>
            <parameter key="k" value="6"/>
            <parameter key="max_runs" value="10"/>
            <parameter key="determine_good_start_values" value="true"/>
            <parameter key="measure_types" value="NominalMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="SquaredEuclideanDistance"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
            <parameter key="max_optimization_steps" value="100"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.3.001" expanded="true" height="103" name="Multiply" width="90" x="380" y="136"/>
          <operator activated="true" class="model_simulator:cluster_model_visualizer" compatibility="9.3.001" expanded="true" height="82" name="Cluster Model Visualizer" width="90" x="581" y="34"/>
          <operator activated="true" class="sort" compatibility="9.3.001" expanded="true" height="82" name="Sort" width="90" x="447" y="238">
            <parameter key="attribute_name" value="cluster"/>
            <parameter key="sorting_direction" value="increasing"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="238">
            <list key="function_descriptions">
              <parameter key="cluster_label" value="cluster"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="715" y="238">
            <parameter key="attribute_name" value="cluster_label"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree" width="90" x="849" y="136">
            <parameter key="criterion" value="gain_ratio"/>
            <parameter key="maximal_depth" value="20"/>
            <parameter key="apply_pruning" value="true"/>
            <parameter key="confidence" value="0.25"/>
            <parameter key="apply_prepruning" value="false"/>
            <parameter key="minimal_gain" value="0.01"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
          </operator>
          <connect from_op="Retrieve ClustersNominalData" from_port="output" to_op="Text to Nominal" to_port="example set input"/>
          <connect from_op="Text to Nominal" from_port="example set output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_op="Cluster Model Visualizer" to_port="model"/>
          <connect from_op="Clustering" from_port="clustered set" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Sort" to_port="example set input"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Cluster Model Visualizer" to_port="clustered data"/>
          <connect from_op="Cluster Model Visualizer" from_port="visualizer output" to_port="result 3"/>
          <connect from_op="Sort" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Set Role" from_port="original" to_port="result 2"/>
          <connect from_op="Decision Tree" from_port="model" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>

  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello @WalterRioja

    I see that it is working fine except for the cluster visualization part because of some missing values in the centroid table. I am not so sure about it, might be my friend @lionelderkrikor can help with this.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve Data para clusterizar" width="90" x="45" y="34">
    <parameter key="repository_entry" value="//Local Repository/data/Data para clusterizar"/>
    </operator>
    <operator activated="true" class="text_to_nominal" compatibility="9.3.001" expanded="true" height="82" name="Text to Nominal" width="90" x="246" y="34">
    <parameter key="attribute_filter_type" value="all"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value=""/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="text"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="text"/>
    <parameter key="block_type" value="value_matrix"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="false"/>
    </operator>
    <operator activated="true" class="concurrency:k_means" compatibility="9.3.001" expanded="true" height="82" name="Clustering" width="90" x="514" y="34">
    <parameter key="add_cluster_attribute" value="true"/>
    <parameter key="add_as_label" value="false"/>
    <parameter key="remove_unlabeled" value="false"/>
    <parameter key="k" value="6"/>
    <parameter key="max_runs" value="10"/>
    <parameter key="determine_good_start_values" value="true"/>
    <parameter key="measure_types" value="MixedMeasures"/>
    <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
    <parameter key="nominal_measure" value="NominalDistance"/>
    <parameter key="numerical_measure" value="EuclideanDistance"/>
    <parameter key="divergence" value="SquaredEuclideanDistance"/>
    <parameter key="kernel_type" value="radial"/>
    <parameter key="kernel_gamma" value="1.0"/>
    <parameter key="kernel_sigma1" value="1.0"/>
    <parameter key="kernel_sigma2" value="0.0"/>
    <parameter key="kernel_sigma3" value="2.0"/>
    <parameter key="kernel_degree" value="3.0"/>
    <parameter key="kernel_shift" value="1.0"/>
    <parameter key="kernel_a" value="1.0"/>
    <parameter key="kernel_b" value="0.0"/>
    <parameter key="max_optimization_steps" value="100"/>
    <parameter key="use_local_random_seed" value="false"/>
    <parameter key="local_random_seed" value="1992"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="9.3.001" expanded="true" height="103" name="Multiply" width="90" x="380" y="136"/>
    <operator activated="true" class="model_simulator:cluster_model_visualizer" compatibility="9.3.001" expanded="true" height="82" name="Cluster Model Visualizer" width="90" x="782" y="34"/>
    <operator activated="true" class="sort" compatibility="9.3.001" expanded="true" height="82" name="Sort" width="90" x="514" y="187">
    <parameter key="attribute_name" value="cluster"/>
    <parameter key="sorting_direction" value="increasing"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="9.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="238">
    <list key="function_descriptions">
    <parameter key="cluster_label" value="cluster"/>
    </list>
    <parameter key="keep_all" value="true"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="849" y="238">
    <parameter key="attribute_name" value="cluster_label"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="9.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="983" y="187">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attribute" value=""/>
    <parameter key="attributes" value="cluster_label|id|idCarrera|idCat_inversion|idDestinoDeseado"/>
    <parameter key="use_except_expression" value="false"/>
    <parameter key="value_type" value="attribute_value"/>
    <parameter key="use_value_type_exception" value="false"/>
    <parameter key="except_value_type" value="time"/>
    <parameter key="block_type" value="attribute_block"/>
    <parameter key="use_block_type_exception" value="false"/>
    <parameter key="except_block_type" value="value_matrix_row_start"/>
    <parameter key="invert_selection" value="false"/>
    <parameter key="include_special_attributes" value="true"/>
    </operator>
    <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree" width="90" x="1117" y="136">
    <parameter key="criterion" value="gain_ratio"/>
    <parameter key="maximal_depth" value="20"/>
    <parameter key="apply_pruning" value="true"/>
    <parameter key="confidence" value="0.25"/>
    <parameter key="apply_prepruning" value="false"/>
    <parameter key="minimal_gain" value="0.01"/>
    <parameter key="minimal_leaf_size" value="2"/>
    <parameter key="minimal_size_for_split" value="4"/>
    <parameter key="number_of_prepruning_alternatives" value="3"/>
    </operator>
    <connect from_op="Retrieve Data para clusterizar" from_port="output" to_op="Text to Nominal" to_port="example set input"/>
    <connect from_op="Text to Nominal" from_port="example set output" to_op="Clustering" to_port="example set"/>
    <connect from_op="Clustering" from_port="cluster model" to_op="Cluster Model Visualizer" to_port="model"/>
    <connect from_op="Clustering" from_port="clustered set" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_op="Sort" to_port="example set input"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Cluster Model Visualizer" to_port="clustered data"/>
    <connect from_op="Cluster Model Visualizer" from_port="visualizer output" to_port="result 4"/>
    <connect from_op="Cluster Model Visualizer" from_port="model output" to_port="result 3"/>
    <connect from_op="Sort" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Set Role" from_port="original" to_port="result 2"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
    <connect from_op="Decision Tree" from_port="model" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    <portSpacing port="sink_result 5" spacing="0"/>
    </process>
    </operator>
    </process>


    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • WalterRiojaWalterRioja Member Posts: 8 Contributor I
    @lionelderkrikor please I need your help, thanks!
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,193 Unicorn
    Hi @varunm1, hi @WalterRioja,

    Yes, in deed there is something weird with this process but linked to the fact that the features are "nominal".
    Honestly, I don't know how RapidMiner internally handle the nominal features. So to avoid this bug, I used Nominal to Numerical operator / (dummy coding). In passing, I updated the Select Attributes operator with these new generated dummies variables and now the process and the visualizations are working.
    The process : 

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="9.3.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
            <parameter key="excel_file" value="C:\Users\Lionel\Downloads\Data para clusterizar.xlsx"/>
            <parameter key="sheet_selection" value="sheet number"/>
            <parameter key="sheet_number" value="1"/>
            <parameter key="imported_cell_range" value="A1"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="date_format" value=""/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="idCat_inversion.true.polynominal.attribute"/>
              <parameter key="1" value="idCarrera.true.polynominal.attribute"/>
              <parameter key="2" value="idDestinoDeseado.true.polynominal.attribute"/>
            </list>
            <parameter key="read_not_matching_values_as_missings" value="false"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="text_to_nominal" compatibility="9.3.001" expanded="true" height="82" name="Text to Nominal" width="90" x="179" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="text"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="text"/>
            <parameter key="block_type" value="value_matrix"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="nominal_to_numerical" compatibility="9.3.001" expanded="true" height="103" name="Nominal to Numerical" width="90" x="313" y="34">
            <parameter key="return_preprocessing_model" value="false"/>
            <parameter key="create_view" value="false"/>
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value="idCat_inversion"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="coding_type" value="dummy coding"/>
            <parameter key="use_comparison_groups" value="false"/>
            <list key="comparison_groups"/>
            <parameter key="unexpected_value_handling" value="all 0 and warning"/>
            <parameter key="use_underscore_in_name" value="false"/>
          </operator>
          <operator activated="true" breakpoints="after" class="concurrency:k_means" compatibility="9.3.001" expanded="true" height="82" name="Clustering" width="90" x="514" y="34">
            <parameter key="add_cluster_attribute" value="true"/>
            <parameter key="add_as_label" value="false"/>
            <parameter key="remove_unlabeled" value="false"/>
            <parameter key="k" value="6"/>
            <parameter key="max_runs" value="10"/>
            <parameter key="determine_good_start_values" value="true"/>
            <parameter key="measure_types" value="MixedMeasures"/>
            <parameter key="mixed_measure" value="MixedEuclideanDistance"/>
            <parameter key="nominal_measure" value="NominalDistance"/>
            <parameter key="numerical_measure" value="EuclideanDistance"/>
            <parameter key="divergence" value="SquaredEuclideanDistance"/>
            <parameter key="kernel_type" value="radial"/>
            <parameter key="kernel_gamma" value="1.0"/>
            <parameter key="kernel_sigma1" value="1.0"/>
            <parameter key="kernel_sigma2" value="0.0"/>
            <parameter key="kernel_sigma3" value="2.0"/>
            <parameter key="kernel_degree" value="3.0"/>
            <parameter key="kernel_shift" value="1.0"/>
            <parameter key="kernel_a" value="1.0"/>
            <parameter key="kernel_b" value="0.0"/>
            <parameter key="max_optimization_steps" value="100"/>
            <parameter key="use_local_random_seed" value="false"/>
            <parameter key="local_random_seed" value="1992"/>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.3.001" expanded="true" height="103" name="Multiply" width="90" x="380" y="238"/>
          <operator activated="true" class="model_simulator:cluster_model_visualizer" compatibility="9.3.001" expanded="true" height="82" name="Cluster Model Visualizer" width="90" x="782" y="34"/>
          <operator activated="true" class="sort" compatibility="9.3.001" expanded="true" height="82" name="Sort" width="90" x="514" y="238">
            <parameter key="attribute_name" value="cluster"/>
            <parameter key="sorting_direction" value="increasing"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.3.001" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="238">
            <list key="function_descriptions">
              <parameter key="cluster_label" value="cluster"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="849" y="238">
            <parameter key="attribute_name" value="cluster_label"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="9.3.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="983" y="187">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="cluster_label|id|idCarrera = Administraci√≥n|idCarrera = Antropolog√≠a|idCarrera = Arqueolog√≠a|idCarrera = ArquitecturayUrbanismo|idCarrera = ArtesEsc√©nicas|idCarrera = ArteyDise√Īo|idCarrera = Biolog√≠a|idCarrera = CienciasdelaComunicaci√≥n|idCarrera = CienciasdelaSalud|idCarrera = CienciasSociales|idCarrera = Contabilidad|idCarrera = Derecho|idCarrera = Dise√ĪoGr√°fico|idCarrera = Econom√≠a|idCarrera = Educaci√≥n|idCarrera = Enfermer√≠a|idCarrera = Finanzas|idCarrera = Gesti√≥nyAltaDirecci√≥n|idCarrera = Hoteler√≠ayturismo|idCarrera = Idiomas|idCarrera = Ingenier√≠aAmbiental|idCarrera = Ingenier√≠aCivil|idCarrera = Ingenier√≠adeSistemas|idCarrera = Ingenier√≠aIndustrial|idCarrera = Ingenier√≠aMecanica|idCarrera = Ingenier√≠aQuimica|idCarrera = Marketing|idCarrera = MedicinaHumana|idCarrera = MedicinaVeterinaria|idCarrera = M√ļsica|idCarrera = NegociosInternacionales|idCarrera = Otros|idCarrera = Psicolog√≠a|idCarrera = Publicidadyafines|idCarrera = TrabajoSocial|idCat_inversion = Inversi√≥nalta|idCat_inversion = Inversi√≥nb√°sica|idCat_inversion = Inversi√≥npromedio|idDestinoDeseado = Afganist√°n|idDestinoDeseado = Albania|idDestinoDeseado = Alemania|idDestinoDeseado = Argelia|idDestinoDeseado = Argentina|idDestinoDeseado = Armenia|idDestinoDeseado = Australia|idDestinoDeseado = Austria|idDestinoDeseado = Azerbaiy√°n|idDestinoDeseado = Bahrein|idDestinoDeseado = Bangladesh|idDestinoDeseado = Benin|idDestinoDeseado = Bielorrusia|idDestinoDeseado = Bolivia|idDestinoDeseado = BosniaHerzegovina|idDestinoDeseado = Botsuana|idDestinoDeseado = Brasil|idDestinoDeseado = Bulgaria|idDestinoDeseado = BurkinaFaso|idDestinoDeseado = B√©lgica|idDestinoDeseado = CaboVerde|idDestinoDeseado = Camboya|idDestinoDeseado = Camer√ļn|idDestinoDeseado = Canad√°|idDestinoDeseado = Chile|idDestinoDeseado = ChinaContinental|idDestinoDeseado = Colombia|idDestinoDeseado = Corea|idDestinoDeseado = CostadeMarfil|idDestinoDeseado = CostaRica|idDestinoDeseado = Croacia|idDestinoDeseado = Dinamarca|idDestinoDeseado = EAU|idDestinoDeseado = Ecuador|idDestinoDeseado = Egipto|idDestinoDeseado = ElSalvador|idDestinoDeseado = Eslovaquia|idDestinoDeseado = Eslovenia|idDestinoDeseado = Espa√Īa|idDestinoDeseado = EstadosUnidos|idDestinoDeseado = Estonia|idDestinoDeseado = Etiop√≠a|idDestinoDeseado = Fiji|idDestinoDeseado = Filipinas|idDestinoDeseado = Finlandia|idDestinoDeseado = Francia|idDestinoDeseado = Gab√≥n|idDestinoDeseado = Georgia|idDestinoDeseado = Ghana|idDestinoDeseado = Grecia|idDestinoDeseado = Guatemala|idDestinoDeseado = HongKong|idDestinoDeseado = Hungr√≠a|idDestinoDeseado = India|idDestinoDeseado = Indonesia|idDestinoDeseado = Irlanda|idDestinoDeseado = Ir√°n|idDestinoDeseado = Islandia|idDestinoDeseado = Italia|idDestinoDeseado = Jap√≥n|idDestinoDeseado = Jord√°n|idDestinoDeseado = Kazajst√°n|idDestinoDeseado = Kenia|idDestinoDeseado = Kirguizst√°n|idDestinoDeseado = Kuwait|idDestinoDeseado = Laos|idDestinoDeseado = Letonia|idDestinoDeseado = Liberia|idDestinoDeseado = Lituania|idDestinoDeseado = L√≠bano|idDestinoDeseado = Macedonia|idDestinoDeseado = Malasia|idDestinoDeseado = Malawi|idDestinoDeseado = Malta|idDestinoDeseado = Marruecos|idDestinoDeseado = Mauricio|idDestinoDeseado = Moldavia|idDestinoDeseado = Mongolia|idDestinoDeseado = Montenegro|idDestinoDeseado = Mozambique|idDestinoDeseado = Myanmar|idDestinoDeseado = M√©xico|idDestinoDeseado = Namibia|idDestinoDeseado = Nepal|idDestinoDeseado = Nicaragua|idDestinoDeseado = Nigeria|idDestinoDeseado = Noruega|idDestinoDeseado = NuevaZelanda|idDestinoDeseado = Om√°n|idDestinoDeseado = Pakist√°n|idDestinoDeseado = Panam√°|idDestinoDeseado = Paraguay|idDestinoDeseado = Pa√≠sesBajos|idDestinoDeseado = Per√ļ|idDestinoDeseado = Polonia|idDestinoDeseado = Portugal|idDestinoDeseado = PuertoRico|idDestinoDeseado = ReinoUnido|idDestinoDeseado = RepublicaCheca|idDestinoDeseado = Rep√ļblicaDominicana|idDestinoDeseado = Ruanda|idDestinoDeseado = Rumania|idDestinoDeseado = Rusia|idDestinoDeseado = Senegal|idDestinoDeseado = Serbia|idDestinoDeseado = Seychelles|idDestinoDeseado = Singapur|idDestinoDeseado = SriLanka|idDestinoDeseado = Sud√°frica|idDestinoDeseado = Suecia|idDestinoDeseado = Suiza|idDestinoDeseado = Tailandia|idDestinoDeseado = Taiw√°n|idDestinoDeseado = Tanzania|idDestinoDeseado = Tayikistan|idDestinoDeseado = Togo|idDestinoDeseado = Turqu√≠a|idDestinoDeseado = T√ļnez|idDestinoDeseado = Ucrania|idDestinoDeseado = Uganda|idDestinoDeseado = Uruguay|idDestinoDeseado = Venezuela|idDestinoDeseado = Vietnam"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="false" class="select_attributes" compatibility="9.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="983" y="391">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="cluster_label|id|idCarrera|idCat_inversion|idDestinoDeseado"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="9.3.001" expanded="true" height="103" name="Decision Tree" width="90" x="1117" y="187">
            <parameter key="criterion" value="gain_ratio"/>
            <parameter key="maximal_depth" value="20"/>
            <parameter key="apply_pruning" value="true"/>
            <parameter key="confidence" value="0.25"/>
            <parameter key="apply_prepruning" value="false"/>
            <parameter key="minimal_gain" value="0.01"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Text to Nominal" to_port="example set input"/>
          <connect from_op="Text to Nominal" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
          <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_op="Cluster Model Visualizer" to_port="model"/>
          <connect from_op="Clustering" from_port="clustered set" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Sort" to_port="example set input"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Cluster Model Visualizer" to_port="clustered data"/>
          <connect from_op="Cluster Model Visualizer" from_port="visualizer output" to_port="result 4"/>
          <connect from_op="Cluster Model Visualizer" from_port="model output" to_port="result 3"/>
          <connect from_op="Sort" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
          <connect from_op="Set Role" from_port="original" to_port="result 2"/>
          <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
          <portSpacing port="sink_result 5" spacing="0"/>
        </process>
      </operator>
    </process>
    
    Hope this helps,

    Regards,

    Lionel



    varunm1
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,193 Unicorn
    Dear all,

    This thread is very interesting because it allows to open a debate : 
    Firstly, for distance based algorithm (like K-means), is it always relevant to "one hot encod" the features of type "category" in RapidMiner ?
    If I'm asking this question, it is because, although RapidMiner has the ability to handle directly  with the features of type "category", in Auto Model there is a one hot encoding of such features in the pre-processing step ... 
    If we go further in this pre-processing step, in Auto-Model, we see that if a feature of type "category" as more than 10 values, then this feature is removed from the modelling step.
    By searching I found that it corresponds to the "Max nominal values" (= 10 by default) of the Remove Low Quality function of CLEANSE in Turbo Prep.
    My question is  : Is there any reason for this hard-coded value of 10 in Auto-Model?
    Intuitively, I would say that this parameter has to be related to the size of the initial dataset instead of a hard-coded value ? (11 possible values for a 10M rows dataset and 11 possible values for a 100 rows dataset have no the same meaning) but maybe there is other reason(s) (time computation, curse to dimensionnality...).
    Moreover I want to mention, that with this strategy,  in some cases (for example the current @WalterRioja 's dataset), in Auto-Model, you have all your features status as "green" (thus in theory used for modelling), but in reality only a subset of these features are effectively used for modelling (and thus only a subset of these features appear in the builded model). I think that may surprise the user...

    Once again, I just want to open the debate, always in the spirit of RapidMiner (and more generally data-science) knowledge improvment, and try to make RapidMiner software better than it already is...

    To conclude, have a nice day (or night ... :) )

    Regards,

    Lionel


      




     



    varunm1
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Yeah, the hard-coded 10 bugs me as well.  However, the problem with one hot encoding is that it can easily let your feature space explore and is hard to predict beforehand what is going to happen.  AM aims at robust results in all cases, not necessarily the optimal results in some.  That is the reason why we allow to open up the process at the end, to allow you to make changes and try what they do for you...
    Hope this makes sense,
    Ingo
    varunm1sgenzerWalterRiojalionelderkrikor
  • WalterRiojaWalterRioja Member Posts: 8 Contributor I
    Hello everyone! Thanks for the support, so I have a question. If I wanted to run an automodel to cluster my data (the same I've shared before) would I get an 'incomplete' wrong result? The fact is all of the three items I need to process are "category" type (those are IDs of other tables in my database).
    A second question would be, why when I run an automodel -without making any changes- I see negative values for some clusters. Why does this happen?

    Thank you all
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    If I wanted to run an automodel to cluster my data (the same I've shared before) would I get an 'incomplete' wrong result?
    No, there are not wrong.  These are just some of the millions of choices you need to do as a data scientist.  As I said before, what AM is doing works for most people / use cases, but may not be what you desire in your case.  That can happen.  In situations where this is more likely, Auto Model exposes the relevant parameter to the user in the UI.  This is not the case here, but you can still open the process in Studio, make the desired change, and run it again to get the new results.

    A second question would be, why when I run an automodel -without making any changes- I see negative values for some clusters. Why does this happen?

    For clustering (or in fact all distance-based methods in machine learning) you normalize the data before the ML algorithm is applied.  This will prevent that some of the columns with a bigger range of values overrule the other columns.  The normalization we perform is a so-called z-standardization and the resulting values will have mean 0 and standard deviation of 1.  Hence the negative values...

    Hope this helps,
    Ingo
    WalterRioja
  • WalterRiojaWalterRioja Member Posts: 8 Contributor I
    @IngoRM about the second question, How could I see the rules of the cluster in a tree based not on the z-standardization resulting values but data (for example, age between 1 and 10 cluster 1, between 11 and 12 cluster 2, etc).
    Is this supported in automodel? Because when I've run my data with AutoModel the tree is shown based on those negatives values I talked about before.

    Thanks!
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Good idea!  The change is actually not that hard so I will look into getting this into AM for one of the future releases.  If you want to try yourself, you can open the clustering process from AM at the end and use the operator De-Normalize on the preprocessing model from the Normalize operator.  You can then apply this de-normalization model on the training data before the tree is built.  Below is a screenshot of the necessary changes.
    Stay tuned,
    Ingo


    sgenzerWalterRiojavarunm1
  • WalterRiojaWalterRioja Member Posts: 8 Contributor I
    @IngoRM that's exactly what I needed. Thank you very much!!

    sgenzerIngoRM
Sign In or Register to comment.