I created an operator chain to cleanup training data and now would like to apply the exact same chain to test data.
How can I do this in the same process without copying the entire chain to feed the test set to ?
Solved! Go to Solution.
There are several ways to handle this situation, but perhaps the easiest thing to do would be to save your first process as "data ETL" or something similar.
Then create a separate process for doing data ETL on your test data, and from that process you simply load the test data (however that is done, via files or db connection) and then call the original ETL process from your repository using the "Execute Process" operator. As long as the test data starts in the same raw format as your original data, this will work fine. And you can also use that same ETL process in the future to transform unlabeled data.
Under this approach, you will only have to maintain the one version of your ETL process, so if you add to it or update it in the future, you don't need to worry about replicating those changes elsewhere. The "Execute Process" operator will always retrieve the most current version of that process to apply.