Options

"Is weight by Information Gain the right operator for me?"

mohammadrezamohammadreza Member Posts: 23 Contributor II
edited June 2019 in Help
Hi all,

I am using the operator "weight by Information Gain" in order to select the most predictive attributes from a data set with 218000 attribute and 60000 examples. (Actually, this is the resultant example set I got by of RapidMiner text processing.)

I have been waiting for 4 days so far and the process is still running on a PC with 32 GB of RAM. I am afraid this is not the right operator for my problem. Would you please explain if I have done something wrong.

BTW, as far as I could understand, the computational complexity of calculating information gain might be proportional to "number of attributes" * "number of examples" which is in my case 218000 * 60000 calculations. Do you think this might not be tractable in a PC? if yes, I do appreciate if you can propose  any alternate solution.

Thanks in advance
Tagged:

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,515 RM Data Scientist
    Hi!

    200.000 Attributes is really a lot. Even in text mining you usually have less.

    You might want to batch it and work on a subset of every attributes, write the weights to file and use it afterwards. Also a sample might be a good solution. Don't forget to use materialze data after the select attributes.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    mohammadrezamohammadreza Member Posts: 23 Contributor II
    Thanks for the answer Martin,

    Just, would you please explain what is materialized data?

    Thanks again
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,515 RM Data Scientist
    Hi

    In Rapidminer an example set is usually just held one time in memory. If you select attributes, you do not delete them, but just deselect them. In order to get a real copy in memory you need to use the Materialze Data operator.
    This is usually not needed. But in this special case you want to be sure to have an example without those attributes, thus i would recommend using it.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    mohammadrezamohammadreza Member Posts: 23 Contributor II
    Thank you so much Martin. That's a very technical and wise point that I was not aware of. If I got your point right, I think this trick will solve many of run time errors related to lack of main memory. Am I right? But another question which occupied my mind is that, if I learn a model with a non-materialized data which is obtained from a feature selection operator, Will this resultant model contains the "unselected" features too?!

    Thanks in advance,

Sign In or Register to comment.