turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community Home
- :
- Product Help
- :
- RapidMiner Studio Forum
- :
- SOLVED: How to handle efficiently large number of ...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-01-2013 04:41 AM

10-01-2013 04:41 AM

Hi,

I have been using Rapidminer and Analytics for quite some time, and the product is really great. Congratulations. After a lot of ETL, I am starting to use models.

My first serious model, consisted in recreating a deterministic model for examples with a high number of classes (thousands) (and millions of examples). Performance and efficiency is key for the implementation.

Therefore, what I did was to create a model using a tree algorithm, setting my own weights. After finetuning the default parameters, the creation of the model worked really great (as long as you don´t want to show the model, as the representation of the tree takes forever and more). Still, I could translate this into rules and verify that the result was fine.

The problem came when applying the model using the "Apply Model" moderator. What this operator does is to create also a probability for each class. This results in an explosion of data in my case. I admit that probably so many classes are not that frequent, but I cannot imagine they are so particular. So, I suppose there must be some way to be able to handle this.

I actually recreated the model with "old fashioned" programming using Rapidminer (it´s a kind of b-tree look-up mechanism), and I could get as quick as 50-70 "predictions" (or mappings) per second, using my quadprocessor, standalone. I would expect that the model mechanism would give me at least a 5x increase on that.

Thankful for any insight...

Julio

I have been using Rapidminer and Analytics for quite some time, and the product is really great. Congratulations. After a lot of ETL, I am starting to use models.

My first serious model, consisted in recreating a deterministic model for examples with a high number of classes (thousands) (and millions of examples). Performance and efficiency is key for the implementation.

Therefore, what I did was to create a model using a tree algorithm, setting my own weights. After finetuning the default parameters, the creation of the model worked really great (as long as you don´t want to show the model, as the representation of the tree takes forever and more). Still, I could translate this into rules and verify that the result was fine.

The problem came when applying the model using the "Apply Model" moderator. What this operator does is to create also a probability for each class. This results in an explosion of data in my case. I admit that probably so many classes are not that frequent, but I cannot imagine they are so particular. So, I suppose there must be some way to be able to handle this.

I actually recreated the model with "old fashioned" programming using Rapidminer (it´s a kind of b-tree look-up mechanism), and I could get as quick as 50-70 "predictions" (or mappings) per second, using my quadprocessor, standalone. I would expect that the model mechanism would give me at least a 5x increase on that.

Thankful for any insight...

Julio

4 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-02-2013 09:00 AM

10-02-2013 09:00 AM

Hi Julio,

does the "data explosion" impose any problems, e.g. in terms of memory consumption? If not, you can simply remove the confidence attributes by using a Select Attributes with the following settings:

regular expression = confidence, include special attributes, invert selection. This will clean your dataset.

As a further remark let me add that to verify that the decision tree is doing a good job you do not need to manually create rules, but you can simply use a cross validation (X-Validation in RapidMiner). If you are not familiar with this concept, a quick google or wikipedia search will provide you with a good overview.

Best regards,

Marius

does the "data explosion" impose any problems, e.g. in terms of memory consumption? If not, you can simply remove the confidence attributes by using a Select Attributes with the following settings:

regular expression = confidence, include special attributes, invert selection. This will clean your dataset.

As a further remark let me add that to verify that the decision tree is doing a good job you do not need to manually create rules, but you can simply use a cross validation (X-Validation in RapidMiner). If you are not familiar with this concept, a quick google or wikipedia search will provide you with a good overview.

Best regards,

Marius

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-02-2013 09:10 AM

10-02-2013 09:10 AM

Hi Marius,

Thank you for the answer.

Indeed based on the relative large number of classes (thousands), this does become in practice, a memory problem. I understand that of course I could throttle the number of entries (which would be hundreds of thousands) to go through the model, but... I was wondering if there would be more efficient options... (I do understand the rational, but looks like the approach for the apply model operator with large nr of classes is not "elegant", whatever that word means in data-analytics... :-))

I induce from your answer, there isn´t. How about an apply model without confidence attributes?

FYI, the reason for me to program things, was to see what the performance would be without using the model.

I will still check performance with model, but this data-explosion is really a breaker... (for this specific context).

Thank you!

Julio

Thank you for the answer.

Indeed based on the relative large number of classes (thousands), this does become in practice, a memory problem. I understand that of course I could throttle the number of entries (which would be hundreds of thousands) to go through the model, but... I was wondering if there would be more efficient options... (I do understand the rational, but looks like the approach for the apply model operator with large nr of classes is not "elegant", whatever that word means in data-analytics... :-))

I induce from your answer, there isn´t. How about an apply model without confidence attributes?

FYI, the reason for me to program things, was to see what the performance would be without using the model.

I will still check performance with model, but this data-explosion is really a breaker... (for this specific context).

Thank you!

Julio

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-10-2013 02:43 AM

10-10-2013 02:43 AM

Hi Julio,

unfortunately there is no possibility to prevent the models in RapidMiner from creating the confidence attributes, but I get your point that for your specific use case it is not very handy.

However I can't believe that you can an acceptable accuracy with one single model for so many classes. Without knowing anything about the underlying concepts of your data it is hard to give any additional help, but maybe it is possible to combine some of the classes to reduce the amount of possible outcomes, and create a kind of hierarchical model, that first predicts one of the combined classes, and a second model then digs deeper to identify the original classes within one combined class?

Best regards,

Marius

unfortunately there is no possibility to prevent the models in RapidMiner from creating the confidence attributes, but I get your point that for your specific use case it is not very handy.

However I can't believe that you can an acceptable accuracy with one single model for so many classes. Without knowing anything about the underlying concepts of your data it is hard to give any additional help, but maybe it is possible to combine some of the classes to reduce the amount of possible outcomes, and create a kind of hierarchical model, that first predicts one of the combined classes, and a second model then digs deeper to identify the original classes within one combined class?

Best regards,

Marius

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

10-10-2013 02:51 AM