The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

"Feature selection output"

fosgenefosgene Member Posts: 9 Contributor II
edited May 2019 in Help
Hi,

I keep following you in your great work with RM since I started using it six mounths ago. I have to say that I am really enthusiastic about this free learning environment.
I am working with linear forward feature selection and I have this problem: I have noticed that the resulting features were always presented sorted as they are in the imported dataset. I mean, is there a way to output the selected features in the same order as they are selected by the operator FeatureSelection? It would be very useful to immediately understand the feature contribution in the total accuracy.
Thank you for you help!

Fosgene

Answers

  • Options
    haddockhaddock Member Posts: 849 Maven
    If you run "samples\05_Features\10_Forward Selection.xml" and then press the "Attributes" tab you can see exactly this, because you can click on the "weight" column to sort it in ascending or descending order. Or was there something else you were looking for?
  • Options
    fosgenefosgene Member Posts: 9 Contributor II
    I'm sorry but I need some different.
    In the same example you cited, in the ProcessLog tab, you can see how root_mean_squared_error decreases depending on the best attributes subsets selected (set x-Axis as "generation" and y-Axis as "performance"). What I would like to know is which subset is associated with a specific generation. For example, if you try setting "maximum_number_of_generations=1", and then try increasing it at each run you do, you will find that the first subset includes only "a2", the second subset ("maximum_number_of_generations=2") is "a2,a1" and so on...
    It means the only "a2" can classify correctly with a r_m_s_e of 152.73, while the pair "a2,a1" can achieve 101.996.
    In a more complex problem, as with features representing medical parameter, the issues of which feature subsets are involved, features order and their respective accuracies are fundamental to obtain the best prediction.
    So, any idea how to do it? To try manually increasing "maximum_number_of_generations"  is out of question for a dataset of 100 features ;)

    Thanks!
  • Options
    steffensteffen Member Posts: 347 Maven
    Hello

    I guess you looking for something like this (The only thing I did was adding an additional value to ProcessLog.) I this correct ?

    <operator name="Root" class="Process" expanded="yes">
        <operator name="Input" class="ExampleSource">
            <parameter key="attributes" value="../data/polynomial.aml"/>
        </operator>
        <operator name="FS" class="FeatureSelection" expanded="yes">
            <operator name="XValidation" class="XValidation" expanded="yes">
                <parameter key="sampling_type" value="shuffled sampling"/>
                <operator name="NearestNeighbors" class="NearestNeighbors">
                    <parameter key="k" value="5"/>
                </operator>
                <operator name="ApplierChain" class="OperatorChain" expanded="yes">
                    <operator name="Applier" class="ModelApplier">
                        <list key="application_parameters">
                        </list>
                    </operator>
                    <operator name="Performance" class="Performance">
                    </operator>
                </operator>
            </operator>
            <operator name="ProcessLog" class="ProcessLog">
                <list key="log">
                  <parameter key="generation" value="operator.FS.value.generation"/>
                  <parameter key="performance" value="operator.FS.value.performance"/>
                  <parameter key="feature_names" value="operator.FS.value.feature_names"/>
                </list>
            </operator>
        </operator>
    </operator>
    Oh and before you ask:
    According to the description of "feature_names" this is the list of features of the current iteration. However I dont know WHY some values are missing and others are repeated. I guess this is a bug, but I do not know enough about the whole value-thing in RapidMiner.

    regards,

    Steffen

  • Options
    fosgenefosgene Member Posts: 9 Contributor II
    It is not a bug. It is exactly what I look for. I have just checked and in version 4.3 I was using, "feature_names" was not implemented yet (I imagine you are using version 4.4).
    Every dot represents the selected subset (the best one) after the respective generation. So, because FS keeps the previous best subset and try adding all the remaining features (one at time) to this subset, if there is an improvement, it will output the previous best subset with the new feature added (=the new best subset). In other way, it builds best features subset increasing linearly its dimension and testing each time its performance until a stopping criteria is reached.

    Regards,

    Fosgene
Sign In or Register to comment.