Caching Data within a Process

pschlunderpschlunder Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 71  RM Research
edited December 2018 in Knowledge Base

large.png

Problem

When creating processes you sometimes want to create temporary ExampleSets, that are stored in the repository, so you don't need to re-run longer lasting processes over and over again. This esp. occurs, when you have processes depending on the results of others.

 

Idea

Create a library process, that only executes a process if its output isn't stored in the repo, yet. Otherwise just read the output from the repository.

 

Solution

 

Before we can start creating a process we need to setup our studio to show the "Context View". Therefore head over to "View -> Show Panel" and select "Context".

Overview

We're going to try to retrieve an ExampleSet from the repository, if this is not possible, execute a desired process.

 

The process we are going to create will be a library process. A library process is a standard process designed to be useful in a general usecase. Hence it can be placed in an extra folder (e.g. named 'lib') to reuse it without changes, when the problem solved by this process occures again.

To create more general processes often Macros and the concept of Context is used. A Macro is a variable that can be used as a placefolder e.g. to fill in parameter values, while Context sets up an environment for a process. In the Context (setup through the Context View) input and output objects can be defined, that can be accessed through a process input and output port. Furthermore you need to define Macros in the Context, that should be accessible from outside the process.

Step-by-Step Walkthrough


Create a new process, copy in the XML provided at the end of this post and save it e.g. in a 'lib' folder and give it a name, e.g. 'caching'.

1. Let's check the Context.

context.PNGContext of the caching process

In the Context View two Macros are defined: 'repo_path' which will be used to store the temporary ExampleSet, and 'path_to_process' which will store the path to the process, that should be executed, when no ExampleSet was created, yet. We're not using 'process_path' as a Macro name, because it's a predefined one. You don't need to fill in values for this to work. I just set some up for testing without including this process inside another. Also: Make sure to use relative locations, that are relative to the location of this library process.

 

2. Trying to retrieve the ExampleSet via Handle Exception.

caching_process.PNGHandling the failure of not being able to retrieve data

Handle Exception is a nested Operator. It executes the process defined inside the 'Try' section and on failure, executes the 'Catch' section. When trying to retrieve our ExampleSet from the repository, loading the location defined by the 'repo_path' Macro will create an Exception, if it is not existing.

caching_process_inner.PNGInside Handle Exception

The Retrieve Operator has %{repo_path} set as the value for the repository entry. The following Print to Console Operator, only prints our a message stating that the ExampleSet was retrieved from the repository.

 

On the 'Catch' site the Execute Process Operator is used, to call a desired process in case no ExampleSet is available, yet. Therefore the process location parameter is set to %{path_to_process}. Afterwards Print to Console logs a message stating that an ExampleSet was created and stores it. The Store Operator again uses the %{repo_path} as a value for the repository entry parameter.

Usage

To illustrate the usage let's have a look at a sample repository:

 repository_before.PNGExample Repository

In this example we have two data sets stored in a 'data' folder, a 'process' folder with a preparation process named' 01 Clean Data', a follow up process named '02 Build Model', that builds up on the (cached) results, a 'lib' folder, with our newly created 'caching' process and a 'results' folder.

We'll use the caching library process at the start of the '02 Build Model' process. It will create an ExampleSet called 'cleaned_data' inside a folder named 'temp' located inside 'data'.

process_with_caching.PNGProcess using caching

To use the caching library process we drag & drop the process into the '02 Build Model' process and setup the Macros to define the location of the ExampleSet to create and the process to execute in order to create the ExampleSet.

caching_configuration.PNGCaching configuration

To setup the Macros, first click on the dragged & dropped Execute caching Operator and then on 'Edit List' for the macros parameter. A new window pops up, where you can setup the Macros defined in the Context of the caching process.

 

On first time execution of this process the ExampleSet stored in 'repo_path' can't be retrieved, hence the processed located at 'path_to_process' will be executed, its result stored. This leads to our example repo looking like this:

repository_after.PNGExample Repository after creation of the cached ExampleSet

If we execute the 02 Build Model process another time, it won't need to create the ExampleSet anymore but fallback on reading the cached ExampleSet and thus run faster.

 

Find the XML code of the caching process here:

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros>
<macro>
<key>repo_path</key>
<value>../../data/temp/cleaned_data</value>
</macro>
<macro>
<key>path_to_process</key>
<value>../clean_data</value>
</macro>
</macros>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="handle_exception" compatibility="8.0.001" expanded="true" height="82" name="Handle Exception" width="90" x="179" y="187">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve" width="90" x="112" y="34">
<parameter key="repository_entry" value="%{repo_path}"/>
</operator>
<operator activated="true" class="print_to_console" compatibility="8.0.001" expanded="true" height="82" name="Print to Console" width="90" x="246" y="34">
<parameter key="log_value" value="reading from cache"/>
</operator>
<connect from_op="Retrieve" from_port="output" to_op="Print to Console" to_port="through 1"/>
<connect from_op="Print to Console" from_port="through 1" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="productivity:execute_process" compatibility="8.0.001" expanded="true" height="82" name="Execute Process" width="90" x="45" y="34">
<parameter key="process_location" value="%{path_to_process}"/>
<parameter key="cache_process" value="false"/>
<list key="macros"/>
<description align="center" color="transparent" colored="false" width="126">process to be executed if repo entry is not available</description>
</operator>
<operator activated="true" class="print_to_console" compatibility="8.0.001" expanded="true" height="82" name="Print to Console (2)" width="90" x="179" y="34">
<parameter key="log_value" value="creating cache"/>
</operator>
<operator activated="true" class="store" compatibility="8.0.001" expanded="true" height="68" name="Store" width="90" x="313" y="34">
<parameter key="repository_entry" value="%{repo_path}"/>
<description align="center" color="transparent" colored="false" width="126">save output of process to cache location</description>
</operator>
<connect from_port="in 1" to_op="Execute Process" to_port="input 1"/>
<connect from_op="Execute Process" from_port="result 1" to_op="Print to Console (2)" to_port="through 1"/>
<connect from_op="Print to Console (2)" from_port="through 1" to_op="Store" to_port="input"/>
<connect from_op="Store" from_port="through" to_port="out 1"/>
<portSpacing port="source_in 1" spacing="0"/>
<portSpacing port="source_in 2" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
<description align="center" color="transparent" colored="false" width="126">try to load data</description>
</operator>
<connect from_op="Handle Exception" from_port="out 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<description align="left" color="yellow" colored="false" height="298" resized="true" width="277" x="80" y="40">#1 path and filename of the temp. data set are set via a macro in the context (View -&amp;gt; Show Panel -&amp;gt; Context)&lt;br&gt;&lt;br&gt;#2 If no data can be found under the path of location #1 a process is executed. The path to the process is also defined in the context.</description>
</process>
</operator>
</process>

 

Thomas_Ott
Sign In or Register to comment.