"Feature selection output"

fosgenefosgene Member Posts: 9 Contributor II
edited May 2019 in Help

I keep following you in your great work with RM since I started using it six mounths ago. I have to say that I am really enthusiastic about this free learning environment.
I am working with linear forward feature selection and I have this problem: I have noticed that the resulting features were always presented sorted as they are in the imported dataset. I mean, is there a way to output the selected features in the same order as they are selected by the operator FeatureSelection? It would be very useful to immediately understand the feature contribution in the total accuracy.
Thank you for you help!



  • haddockhaddock Member Posts: 849 Maven
    If you run "samples\05_Features\10_Forward Selection.xml" and then press the "Attributes" tab you can see exactly this, because you can click on the "weight" column to sort it in ascending or descending order. Or was there something else you were looking for?
  • fosgenefosgene Member Posts: 9 Contributor II
    I'm sorry but I need some different.
    In the same example you cited, in the ProcessLog tab, you can see how root_mean_squared_error decreases depending on the best attributes subsets selected (set x-Axis as "generation" and y-Axis as "performance"). What I would like to know is which subset is associated with a specific generation. For example, if you try setting "maximum_number_of_generations=1", and then try increasing it at each run you do, you will find that the first subset includes only "a2", the second subset ("maximum_number_of_generations=2") is "a2,a1" and so on...
    It means the only "a2" can classify correctly with a r_m_s_e of 152.73, while the pair "a2,a1" can achieve 101.996.
    In a more complex problem, as with features representing medical parameter, the issues of which feature subsets are involved, features order and their respective accuracies are fundamental to obtain the best prediction.
    So, any idea how to do it? To try manually increasing "maximum_number_of_generations"  is out of question for a dataset of 100 features ;)

  • steffensteffen Member Posts: 347 Maven

    I guess you looking for something like this (The only thing I did was adding an additional value to ProcessLog.) I this correct ?

    <operator name="Root" class="Process" expanded="yes">
        <operator name="Input" class="ExampleSource">
            <parameter key="attributes" value="../data/polynomial.aml"/>
        <operator name="FS" class="FeatureSelection" expanded="yes">
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="sampling_type" value="shuffled sampling"/>
                <operator name="NearestNeighbors" class="NearestNeighbors">
                    <parameter key="k" value="5"/>
                <operator name="ApplierChain" class="OperatorChain" expanded="yes">
                    <operator name="Applier" class="ModelApplier">
                        <list key="application_parameters">
                    <operator name="Performance" class="Performance">
            <operator name="ProcessLog" class="ProcessLog">
                <list key="log">
                  <parameter key="generation" value="operator.FS.value.generation"/>
                  <parameter key="performance" value="operator.FS.value.performance"/>
                  <parameter key="feature_names" value="operator.FS.value.feature_names"/>
    Oh and before you ask:
    According to the description of "feature_names" this is the list of features of the current iteration. However I dont know WHY some values are missing and others are repeated. I guess this is a bug, but I do not know enough about the whole value-thing in RapidMiner.



  • fosgenefosgene Member Posts: 9 Contributor II
    It is not a bug. It is exactly what I look for. I have just checked and in version 4.3 I was using, "feature_names" was not implemented yet (I imagine you are using version 4.4).
    Every dot represents the selected subset (the best one) after the respective generation. So, because FS keeps the previous best subset and try adding all the remaining features (one at time) to this subset, if there is an improvement, it will output the previous best subset with the new feature added (=the new best subset). In other way, it builds best features subset increasing linearly its dimension and testing each time its performance until a stopping criteria is reached.


Sign In or Register to comment.