When testing I read data from a CSV. I'd like to limit the samples to several categories which is dynamically generated from a training set.
The training set might only have 20 categories but the test set could have 200.  I only want to test on the 20.  
The rest of the samples will be filtered out.

I read in the training set and extract the category list.
I remove duplicates to now have a unique list of categories.
This is what I want to filter my test set on.

I save the list to a file for later lookup if needed.
Now i'd like to read in the test data, filter on that list of categories and press on with testing.

How would I do such a thing? 


    I realized I could solve this be taking the unique list of categories and performing an inner join (operator) with the test set using the category column as the key attribute. that removes all the unwanted samples.   easy! 
