How to Index Models with Old World Computing's Jackhammer Extension

Leonie_OWCLeonie_OWC Member, KB Contributor Posts: 12 Contributor II
edited January 9 in Knowledge Base
Hello,

in the weeks before Christmas last year, I demonstrated how to index collections using Old World Computing's RapidMiner extension Jackhammer. Today, I'm back with another knowledge base article regarding a connected topic that goes even further: indexing models. Below is a sample process which you will be able execute with the Jackhammer extension installed, but you shouldn't need to buy a license. You can find the extension on the marketplace.

The process uses the "Deals" sample data found in the RapidMiner samples repository. Often, as you surely know, a model might be good, but could be even better when analysing by a certain value of an attribute – for example, distinguishing between genders. While the model yields fine results as it is, predictions are more accurate when using two seperate models for e.g. men and women. However, for large amounts of data with many different values this is a very cumbersome task. With the Jackhammer extension, you can simply index your model:

You feed one big example set into the Indexed Model operator and specify one or several attributes you would like to group by in the parameters. These are the index attributes. Inside the subprocess you can build and train your model just like you would do with a normal model. The operator constructs as many models as different values there are for the attribute or attributes you selected, i.e. the index attribute. So for our small example, we will get one model each for women and men.
The important thing here is that it does construct many models, but you will only receive one indexed model. This makes this an incredibly powerful way to deal with cases where you have to predict one aspect for many different kinds of one thing. When you apply the indexed model, the relevant model will automatically be chosen for the data at hand – but you only have to deal with one indexed model, and are freed of the hassle to select the correct model from possibly hundreds, besides the improvement in the structure and tidiness of your processes.

The example process shows the perfomance vector of an indexed model in comparison to a normal model. If you have any questions, please feel free to ask!

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002"
expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve"
compatibility="7.5.003" expanded="true" height="68" name="Retrieve
Deals" width="90" x="45" y="34">
        <parameter key="repository_entry" value="//Samples/data/Deals"/>
      </operator>
      <operator activated="true" class="multiply"
compatibility="7.5.003" expanded="true" height="103" name="Multiply"
width="90" x="179" y="34"/>
      <operator activated="true" class="rmx_toolkit:indexed_model"
compatibility="2.2.882" expanded="true" height="103" name="Indexed
Model" width="90" x="314" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Gender"/>
        <process expanded="true">
          <operator activated="true"
class="concurrency:cross_validation" compatibility="7.5.003"
expanded="true" height="145" name="Validation" width="90" x="112" y="34">
            <parameter key="number_of_folds" value="7"/>
            <parameter key="sampling_type" value="stratified sampling"/>
            <process expanded="true">
              <operator activated="true"
class="concurrency:parallel_decision_tree" compatibility="7.5.003"
expanded="true" height="82" name="Decision Tree" width="90" x="112" y="34">
                <parameter key="criterion" value="gini_index"/>
                <parameter key="maximal_depth" value="2"/>
              </operator>
              <connect from_port="training set" to_op="Decision Tree"
to_port="training set"/>
              <connect from_op="Decision Tree" from_port="model"
to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model"
compatibility="7.5.003" expanded="true" height="82" name="Apply Model"
width="90" x="45" y="34">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance"
compatibility="7.5.003" expanded="true" height="82" name="Performance"
width="90" x="179" y="34"/>
              <connect from_port="model" to_op="Apply Model"
to_port="model"/>
              <connect from_port="test set" to_op="Apply Model"
to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data"
to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance"
to_port="performance 1"/>
              <connect from_op="Performance" from_port="example set"
to_port="test set results"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="batch of example set" to_op="Validation"
to_port="example set"/>
          <connect from_op="Validation" from_port="model" to_port="model"/>
          <portSpacing port="source_batch of example set" spacing="0"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_loop 1" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_output collector 1" spacing="0"/>
          <portSpacing port="sink_loop 1" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false"
width="126">Split the dataset for different
genders.&lt;br&gt;&lt;br&gt;Build a composite model with individual
models for each gender.</description>
      </operator>
      <operator activated="true" class="concurrency:cross_validation"
compatibility="7.5.003" expanded="true" height="145" name="Validation
(2)" width="90" x="313" y="340">
        <parameter key="number_of_folds" value="7"/>
        <parameter key="sampling_type" value="stratified sampling"/>
        <process expanded="true">
          <operator activated="true"
class="concurrency:parallel_decision_tree" compatibility="7.5.003"
expanded="true" height="82" name="Decision Tree (2)" width="90" x="112"
y="34">
            <parameter key="criterion" value="gini_index"/>
            <parameter key="maximal_depth" value="2"/>
          </operator>
          <connect from_port="training set" to_op="Decision Tree (2)"
to_port="training set"/>
          <connect from_op="Decision Tree (2)" from_port="model"
to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model"
compatibility="7.5.003" expanded="true" height="82" name="Apply Model
(3)" width="90" x="45" y="34">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance"
compatibility="7.5.003" expanded="true" height="82" name="Performance
(3)" width="90" x="179" y="34"/>
          <connect from_port="model" to_op="Apply Model (3)"
to_port="model"/>
          <connect from_port="test set" to_op="Apply Model (3)"
to_port="unlabelled data"/>
          <connect from_op="Apply Model (3)" from_port="labelled data"
to_op="Performance (3)" to_port="labelled data"/>
          <connect from_op="Performance (3)" from_port="performance"
to_port="performance 1"/>
          <connect from_op="Performance (3)" from_port="example set"
to_port="test set results"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_test set results" spacing="0"/>
          <portSpacing port="sink_performance 1" spacing="0"/>
          <portSpacing port="sink_performance 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false"
width="126">for comparison: use the same settings to build a model for
the whole dataset.</description>
      </operator>
      <operator activated="true" class="retrieve"
compatibility="7.5.003" expanded="true" height="68" name="Retrieve
Deals-Testset" width="90" x="447" y="85">
        <parameter key="repository_entry"
value="//Samples/data/Deals-Testset"/>
      </operator>
      <operator activated="true" class="apply_model"
compatibility="7.5.003" expanded="true" height="82" name="Apply Model
(2)" width="90" x="582" y="34">
        <list key="application_parameters"/>
        <description align="center" color="transparent" colored="false"
width="126">Automatically apply the appropiate model depending on the
gender</description>
      </operator>
      <operator activated="true" class="performance_classification"
compatibility="7.5.003" expanded="true" height="82" name="Performance
Indexed Model" width="90" x="715" y="34">
        <list key="class_weights"/>
      </operator>
      <operator activated="true" class="retrieve"
compatibility="7.5.003" expanded="true" height="68" name="Retrieve
Deals-Testset (2)" width="90" x="447" y="391">
        <parameter key="repository_entry"
value="//Samples/data/Deals-Testset"/>
      </operator>
      <operator activated="true" class="apply_model"
compatibility="7.5.003" expanded="true" height="82" name="Apply Model
(4)" width="90" x="580" y="340">
        <list key="application_parameters"/>
      </operator>
      <operator activated="true" class="performance_classification"
compatibility="7.5.003" expanded="true" height="82" name="Performance
normal model" width="90" x="713" y="340">
        <list key="class_weights"/>
      </operator>
      <connect from_op="Retrieve Deals" from_port="output"
to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_op="Indexed
Model" to_port="example set"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Validation
(2)" to_port="example set"/>
      <connect from_op="Indexed Model" from_port="model" to_op="Apply
Model (2)" to_port="model"/>
      <connect from_op="Validation (2)" from_port="model" to_op="Apply
Model (4)" to_port="model"/>
      <connect from_op="Retrieve Deals-Testset" from_port="output"
to_op="Apply Model (2)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (2)" from_port="labelled data"
to_op="Performance Indexed Model" to_port="labelled data"/>
      <connect from_op="Performance Indexed Model"
from_port="performance" to_port="result 1"/>
      <connect from_op="Retrieve Deals-Testset (2)" from_port="output"
to_op="Apply Model (4)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (4)" from_port="labelled data"
to_op="Performance normal model" to_port="labelled data"/>
      <connect from_op="Performance normal model"
from_port="performance" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="6"/>
      <portSpacing port="sink_result 3" spacing="0"/>
    </process>
  </operator>
</process>


Jasmine_sgenzer
Sign In or Register to comment.