Process Testing or Unit Testing in RapidMiner
What is Process testing?
The Process Testing extension streamlines testing RapidMiner processes for RapidMiner users and extension developers.
The Process Testing extension allows creating process-based unit tests. Processes may be run as tests which automatically saves the results of the process as expected results. Later, processes may be re-run again to automatically compare the then-created results with the previously created expected results. Tests are passed if the results are still the same or they fail if this is not the case.
Benefits and steps to implement process testing in RapidMiner:
- Very powerful way to monitor the changes in the expected results of the process at each phase of end-to-end data science workflow.
For example if the input data of a process from an external data source changes, the performance matrix of the training model may differ from the matrix of the previously trained model which is stored as the expected result (benchmark or reference matrix).
'Execute Process Tests' operator in this extension helps detect these changes and flags such process execution outcome as a failure. Further RapidMiner Server scheduling feature can help send automated email alerts to a system admin or process developer about the execution status.
Similarly, we could extend the same concept to detect changes in weights of attributes for a given ExampleSet, changes in scoring results, ETL, data cleansing, feature engineering, feature selection workflows, send alerts if the champion machine learning model changes, and so on.
How do I use it?
1) Download the extension in RapidMiner Studio from the Extension >> Marketplace and re-start RapidMiner Studio:Download the extension
Once the Studio is re-started you will find a new menu option called Testing and 2 new operators in the Operator >> Extensions FolderTesting - menu option in StudioProcessing Testing operators
2) Select a RapidMiner process that you wish to perform the process testing on. Save the process in the Repository, navigate to Testing >> Run And Store As Expected Results. This runs the current process and stores all the output ports as expected results in the current repository location of the process.
Note: This is a very important and mandatory step in the Process testing framework since it creates the expected results of given process and is used as a reference to compare results when the same process is re-run again.
In the following example, I have created 001 sampling process to sample down the 'Titanic' data 100 sample or examples. When the Run And Store As Expected Results option is executed, the result of the process at the output port is stored in the repository as shown below:
Sampling down data and storing the expected results of the output of the process 3) Use the Execute process tests operator to test this process to check if there is any change in the output results.
(Save this process in the Repository as 'Execute Process')
Execute Process to test the process 001 sampling resultsIf there are no changes in the 01 sampling process the output will be same as the stored expected results '01 sampling-expected-port-0' and the test outcome will be success as shown below:
01 sample testing results
Now let's make some changes to the 01 sampling process and examine how Execute Process detects changes between the actual result of the process and the expected results of the process:
Change the sample size to 1000 and save the process
If we re-run Execute Process, the outcome is a failure since the expected result '01 sampling-expected-port-0' has 100 samples stored, but the updated 01 sampling process results in 1000 samples as shown below:
Outcome: Failure, with appropriate Error message
Further, we can extend this process to Run Batch testing. 'Execute Process Tests' operator executes all processes under the selected repository location. The location can either be a single process, a folder, or a repository. Underlying folders are also executed recursively. The result contains the location, the outcome (success or failure) and the error message for each executed process.
Processes, which contain the string literal "NOTEST" in their name, are not executed.
Batch testing of the processes
4) Schedule to run this process on the RapidMiner Server to send automatic email alters on the failure of any given process:
Send automatic email alerts when a process fails in Process testing
Here is a sample alert email sent with the list of failed processes when the above process is triggered to execute as per the schedule:
Sample email alert
- To execute Process Testing on RapidMiner Server, make sure to install the extension jar in RapidMiner Server home directory and the job-agent you intend to run/schedule this process;
- To use the Send Mail operator, make sure to set up your email settings on Studio and Server accordingly.
Extension developers can test if the operator they have developed is user-friendly. The Process Testing extension has an operator named 'Expect User Error'. This operator is a nested operator i.e. it has a Subprocess.
This operator first tries to execute the Subprocess. If a UserError (defined by the i18n_key parameter: i18n_key maps to particular user error) occurs during the execution of the Subprocess, this operator runs successfully and writes a log entry. If there is no UserError, or if during the execution of the subprocess there is a UserError that is different from what was specified by the i18n_key parameter, the operator throws a UserError itself, which contains information about the reason of failure.
There is a sample tutorial process in the help section of this operator to demonstrate the functionality of the 'Expect User Error' operator:
Expect User Error
You can find all data and processes associated with this post in the Community Repository inside RapidMiner Studio.
Hope you find this article useful. Feel free to post comments, feedback, and questions about Process Testing.