Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
SOLVED: How to handle efficiently large number of classes in models
Hi,
I have been using Rapidminer and Analytics for quite some time, and the product is really great. Congratulations. After a lot of ETL, I am starting to use models.
My first serious model, consisted in recreating a deterministic model for examples with a high number of classes (thousands) (and millions of examples). Performance and efficiency is key for the implementation.
Therefore, what I did was to create a model using a tree algorithm, setting my own weights. After finetuning the default parameters, the creation of the model worked really great (as long as you don´t want to show the model, as the representation of the tree takes forever and more). Still, I could translate this into rules and verify that the result was fine.
The problem came when applying the model using the "Apply Model" moderator. What this operator does is to create also a probability for each class. This results in an explosion of data in my case. I admit that probably so many classes are not that frequent, but I cannot imagine they are so particular. So, I suppose there must be some way to be able to handle this.
I actually recreated the model with "old fashioned" programming using Rapidminer (it´s a kind of b-tree look-up mechanism), and I could get as quick as 50-70 "predictions" (or mappings) per second, using my quadprocessor, standalone. I would expect that the model mechanism would give me at least a 5x increase on that.
Thankful for any insight...
Julio
I have been using Rapidminer and Analytics for quite some time, and the product is really great. Congratulations. After a lot of ETL, I am starting to use models.
My first serious model, consisted in recreating a deterministic model for examples with a high number of classes (thousands) (and millions of examples). Performance and efficiency is key for the implementation.
Therefore, what I did was to create a model using a tree algorithm, setting my own weights. After finetuning the default parameters, the creation of the model worked really great (as long as you don´t want to show the model, as the representation of the tree takes forever and more). Still, I could translate this into rules and verify that the result was fine.
The problem came when applying the model using the "Apply Model" moderator. What this operator does is to create also a probability for each class. This results in an explosion of data in my case. I admit that probably so many classes are not that frequent, but I cannot imagine they are so particular. So, I suppose there must be some way to be able to handle this.
I actually recreated the model with "old fashioned" programming using Rapidminer (it´s a kind of b-tree look-up mechanism), and I could get as quick as 50-70 "predictions" (or mappings) per second, using my quadprocessor, standalone. I would expect that the model mechanism would give me at least a 5x increase on that.
Thankful for any insight...
Julio
0
Answers
does the "data explosion" impose any problems, e.g. in terms of memory consumption? If not, you can simply remove the confidence attributes by using a Select Attributes with the following settings:
regular expression = confidence, include special attributes, invert selection. This will clean your dataset.
As a further remark let me add that to verify that the decision tree is doing a good job you do not need to manually create rules, but you can simply use a cross validation (X-Validation in RapidMiner). If you are not familiar with this concept, a quick google or wikipedia search will provide you with a good overview.
Best regards,
Marius
Thank you for the answer.
Indeed based on the relative large number of classes (thousands), this does become in practice, a memory problem. I understand that of course I could throttle the number of entries (which would be hundreds of thousands) to go through the model, but... I was wondering if there would be more efficient options... (I do understand the rational, but looks like the approach for the apply model operator with large nr of classes is not "elegant", whatever that word means in data-analytics... :-))
I induce from your answer, there isn´t. How about an apply model without confidence attributes?
FYI, the reason for me to program things, was to see what the performance would be without using the model.
I will still check performance with model, but this data-explosion is really a breaker... (for this specific context).
Thank you!
Julio
unfortunately there is no possibility to prevent the models in RapidMiner from creating the confidence attributes, but I get your point that for your specific use case it is not very handy.
However I can't believe that you can an acceptable accuracy with one single model for so many classes. Without knowing anything about the underlying concepts of your data it is hard to give any additional help, but maybe it is possible to combine some of the classes to reduce the amount of possible outcomes, and create a kind of hierarchical model, that first predicts one of the combined classes, and a second model then digs deeper to identify the original classes within one combined class?
Best regards,
Marius
The point is that I have a set of rules that determine 100% always the correct result (by definition). I also understand your point, this is not much of a prediction model....
Thanks again!
Julio