The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Thoughts about memory consumption and FeatureSelection...
Hi everybody,
I'm running the 32 bit Version of RapidMiner 4.6 and try to do a forward feature selection on a data set with 100 examples and 2000 features. After 5 hours RapidMiner used 1.4 GB RAM and finished with an Out of Memory error :-(
Searching the forum I found several posts dealing with memory consumption and that it might be a bad idea to do feature selection on such a large data set. Then I tried to do a rough calculation of the necessary memory:
100 examples * 2000 features * 8 byte = 1.6 MB
For the first generation the FeatureSelection algorithm will create 2000 individuals making this 3.2 GB, so no wonder that I run out of memory.
But then I realized that this is true for a backward feature selection, but not for a forward feature selection !
Forward selection starts with a single attribute, so all the individuals of the first generation only need
100 examples * 1 feature * 8 byte * 2000 individuals = 1.6 MB !!
So, now I'm back to square one. Why is forward feature selection needing so much memory ??
My only guess is that, although not necessary, the individuals do nevertheless get a full copy of the data set !?
If this is true, the code urgently needs a revision.
Maybe someone can comment on this ?
Many thanks,
Axel
I'm running the 32 bit Version of RapidMiner 4.6 and try to do a forward feature selection on a data set with 100 examples and 2000 features. After 5 hours RapidMiner used 1.4 GB RAM and finished with an Out of Memory error :-(
Searching the forum I found several posts dealing with memory consumption and that it might be a bad idea to do feature selection on such a large data set. Then I tried to do a rough calculation of the necessary memory:
100 examples * 2000 features * 8 byte = 1.6 MB
For the first generation the FeatureSelection algorithm will create 2000 individuals making this 3.2 GB, so no wonder that I run out of memory.
But then I realized that this is true for a backward feature selection, but not for a forward feature selection !
Forward selection starts with a single attribute, so all the individuals of the first generation only need
100 examples * 1 feature * 8 byte * 2000 individuals = 1.6 MB !!
So, now I'm back to square one. Why is forward feature selection needing so much memory ??
My only guess is that, although not necessary, the individuals do nevertheless get a full copy of the data set !?
If this is true, the code urgently needs a revision.
Maybe someone can comment on this ?
Many thanks,
Axel
0
Answers
unfortunately this part of RapidMiner is quite old and although following the nice generalization idea of mapping everything to population based operations, it has the disadvantage of being quite inefficient.
Although not made public yet, we are providing an extension giving you efficient implementations of forward and backward selection. We are going to add a few more valuable operators before publishing, but if you are interested, we probably could give you a pre-version...
Greetings,
Sebastian
your new implementation of feature selection sound very interesting.
Of course I would like to try it, if possible.
How would I get it ?
Axel
P.S. Sorry for the delay. I was on a short holiday :-)
no problem about that. I hope, you had a good time, while we were working
For further informations about the plugin, could you please write an email to contact@rapid-i.com?
Greetings,
Sebastian