scoring and storing very large datasets in RM - any hint?
The stream database operator is one of the essential operators in handling very large datasets in RM (if not the most important one).
I did the following experiment: using this operator, I have sampled a large dataset stored in a database, such that the sample fits and can be handled in the main memory in order to learn a model (data preprocessing included). So far so good: I got and evaluated the model and was happy with its performance parameters, so I saved it. Then I applied the saved model to the whole dataset that, logically, was accessed via the same stream database operator, with the intention to save the result in a new table of the database.
The process failed - with the suggestion of materialising the dataset in the memory first (!!), which is not the solution given the size of the dataset.
Although I find it obvious how to implement this in a consecrated Data Mining suite as SPSS Clementine/ Modeler or SAS Enterprise Miner, I cannot see another approach of scoring and storing the whole (large) dataset with RM. I assume it should be possible. Many thanks to those that would like to share from their experience or provide a useful hint.