RapidMiner

Extract decision tree from Bray-curtis heatmap dendrogram

Wisdom logo Registration now open for RapidMiner Wisdom Americas | New Orleans | October 10-12, 2018   Learn More
Highlighted
Newbie jamie_slk
Newbie

Extract decision tree from Bray-curtis heatmap dendrogram

I am performing microbiome study, and have already generated (using another program) a heatmap with dendrograms for clustering samples based on bacterial genus using Bray-Curtis dissimilarity, but I'd like to get the decision tree. I know RapidMiner has a decision tree model, but it must use k-means which is different from Bray-Curtis, and I want to preserve the Bray-Curtis clustering. I wonder if it's possible to load my dendrogram into RapidMiner and have it extract the Bray-Curtis decision tree? Thank you very much.

1 REPLY
RM Staff RM Staff
RM Staff

Re: Extract decision tree from Bray-curtis heatmap dendrogram

Hi @jamie_slk,

 

If you are doing clustering analysis with microbiome data, can you please share some test data?

 

First thing, the 'tree' from heatmap may  NOT be a 'decision tree'. It is a visulization of your Hierarchical cluster model. If you can get the clustering label out of another program. You can build predictive models (e.g. decision tree, or random forest, or SVM) to find the splits and decision rules that are used for clustering.

 

Regarding to the dissimilarity measure, do you want to use jaccard instead of Bray-Curtis? Jaccard index is computed as 2B/(1+B), where B is Bray–Curtis dissimilarity [ref].  Bray–Curtis and Jaccard indices are rank-order similar, but, Jaccard index is metric, and probably should be preferred instead of the default Bray-Curtis which is semimetric [ref]. RapidMiner core has an operator for Hierachical clustering (Agglomerative Clustering) with jaccard similarity on numerical data. 

 

My process used peerj32 data from https://peerj.com/articles/32/#supplemental-information

bacteria.PNGdecision-tree-rules.PNGtree.PNG

You have to install R scripts extension, and operator toolbox extension from marketplace to run it.

The proces will call R for BC dissmilarities and clustering 

dist.mat<-vegdist(dataframe,method="bray", diag=T, upper=T) # or use jaccard
clust.res<-hclust(dist.mat) 
cluster.label <- cutree(clust.res, k = 4) 
#cut the tree into four clusters and reconstruct the upper part of the tree from the cluster centers.

Process code:

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve peerj32_microbes" width="90" x="45" y="34">
        <parameter key="repository_entry" value="peerj32_microbes"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="8.1.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="34">
        <parameter key="attribute_name" value="bacteria"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="agglomerative_clustering" compatibility="8.1.001" expanded="true" height="82" name="Clustering" width="90" x="313" y="34">
        <parameter key="measure_types" value="NumericalMeasures"/>
        <parameter key="numerical_measure" value="JaccardSimilarity"/>
        <description align="center" color="transparent" colored="false" width="126">use jaccard similarity for Hierarchical cluster</description>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="8.1.001" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="187">
        <parameter key="attribute_filter_type" value="value_type"/>
        <parameter key="value_type" value="numeric"/>
        <parameter key="include_special_attributes" value="true"/>
      </operator>
      <operator activated="true" class="r_scripting:execute_r" compatibility="8.1.000" expanded="true" height="82" name="Execute R" width="90" x="581" y="187">
        <parameter key="script" value="# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;library(vegan)&#10;rm_main = function(dataframe)&#10;{&#10;&#10;    dist.mat&lt;-vegdist(dataframe,method=&quot;bray&quot;, diag=T, upper=T) # or use jaccard&#10;    print(dist.mat)&#10;    clust.res&lt;-hclust(dist.mat) &#10;    cluster.label &lt;- cutree(clust.res, k = 4) &#10;    #cut the tree into four clusters and reconstruct the upper part of the&#10;    ## tree from the cluster centers.&#10;    data&lt;- as.data.frame(cluster.label)&#10;    return(data)&#10;}&#10;"/>
        <description align="center" color="transparent" colored="false" width="126">run R scipts for Bray Curtis distances and clustering, return the clustering lables</description>
      </operator>
      <operator activated="true" class="operator_toolbox:merge" compatibility="0.9.000" expanded="true" height="103" name="Merge" width="90" x="715" y="187">
        <parameter key="handling_of_duplicate_attributes" value="keep_only_first"/>
      </operator>
      <operator activated="true" class="numerical_to_polynominal" compatibility="8.1.001" expanded="true" height="82" name="Numerical to Polynominal" width="90" x="849" y="187">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="cluster.label"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="8.1.001" expanded="true" height="82" name="Set Role (2)" width="90" x="983" y="187">
        <parameter key="attribute_name" value="cluster.label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="concurrency:parallel_decision_tree" compatibility="8.1.001" expanded="true" height="103" name="Decision Tree" width="90" x="1117" y="187"/>
      <operator activated="true" class="apply_model" compatibility="8.1.001" expanded="true" height="82" name="Apply Model" width="90" x="1251" y="187">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="8.1.001" expanded="true" height="82" name="Performance" width="90" x="1385" y="238">
        <parameter key="classification_error" value="true"/>
        <parameter key="kendall_tau" value="true"/>
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="converters:dectree_2_example_set" compatibility="0.3.001" expanded="true" height="82" name="Decision Tree to ExampleSet" width="90" x="1385" y="85"/>
      <connect from_op="Retrieve peerj32_microbes" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Clustering" to_port="example set"/>
      <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
      <connect from_op="Clustering" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Execute R" to_port="input 1"/>
      <connect from_op="Select Attributes" from_port="original" to_op="Merge" to_port="example set 2"/>
      <connect from_op="Execute R" from_port="output 1" to_op="Merge" to_port="example set 1"/>
      <connect from_op="Merge" from_port="merged set" to_op="Numerical to Polynominal" to_port="example set input"/>
      <connect from_op="Numerical to Polynominal" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
      <connect from_op="Set Role (2)" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
      <connect from_op="Decision Tree" from_port="model" to_op="Apply Model" to_port="model"/>
      <connect from_op="Decision Tree" from_port="exampleSet" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
      <connect from_op="Apply Model" from_port="model" to_op="Decision Tree to ExampleSet" to_port="tree"/>
      <connect from_op="Performance" from_port="performance" to_port="result 4"/>
      <connect from_op="Performance" from_port="example set" to_port="result 5"/>
      <connect from_op="Decision Tree to ExampleSet" from_port="exa" to_port="result 2"/>
      <connect from_op="Decision Tree to ExampleSet" from_port="original tree" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="42"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="105"/>
      <portSpacing port="sink_result 5" spacing="0"/>
      <portSpacing port="sink_result 6" spacing="0"/>
    </process>
  </operator>
</process>

Cheers,

YY