"Feature selection stability validation"

MarlaBotMarlaBot The Friendly RapidMiner Dog BotAdministrator, Moderator, Employee, Member Posts: 57 Community Manager
edited May 2019 in Help
A RapidMiner user wants to know the answer to this question: Are there any tutorials or best practices for feature selection stability validation?

Answers

  • ozgeozyazarozgeozyazar Member Posts: 21 Maven
    Hi ! I need to figure out how I can apply selection stability validation process. It is really important for my thesis application part. Does anyone experienced this process ? 

    Sincerely, 

    özge 
  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hi @ozgeozyazar

    Can you check the below link and see if this is helpful.

    https://rapidminer.com/blog/multi-objective-optimization-feature-selection/
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

    lionelderkrikorsgenzer
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,287 RM Data Scientist
    the feature selection extension can validate the selection via Jaccard Index. Is that what you are referring to?

    BR,
    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    sgenzerozgeozyazar
  • ozgeozyazarozgeozyazar Member Posts: 21 Maven
    Hi @mschmitz !
    Nearlly, it is. But as far as ı know, the feature selection stability validation operator uses kuncheva                  index. I would like to use this operator but cannot find any practice as an example. Could you please advise for any tutorial or example that explains how I can use the process ?
     
     

  • MaerkliMaerkli Member Posts: 84 Guru
    As you look for an example, it could perhaps help:  Book "RapidMiner Datamining Use Cases" by Markus Hofmann and Ralf Klinkenberg, Chapter XVI. There is an example of a weghting operator placed inside the Feature Selection Stability Validation. It concerns an application in Neutrino Astronomy.
    Maerkli
     

  • ozgeozyazarozgeozyazar Member Posts: 21 Maven
    Hi @Maerkli

    unfotunately, I have no chance to find the book immediately. Actually, some resourses indicates that operator works same as x validation. The problem is, I cannot figure out which operator/model should apply in the stability operator. If an example gives answer to that could you please help me ? 

    Regards, 
  • MaerkliMaerkli Member Posts: 84 Guru
    Sorry for the late answer, I took some days off. Some features used for this Neutrino experiment are no longer supported. I don't know if it makes sense to send the XML files. But enclosed some passages of the explanation given in this chapter 16:
    16.3.6 Feature Selection Stability
    When running a feature selection algorithm, not only the selection of attributes itself is
    important, but also the stability of the selection has to be taken into account. The stability indicates how much the choice of a good attribute set is independent of the particular sample of examples. If the subsets of features chosen on the basis of different samples are very different, the choice is not stable. The difference of feature sets can be expressed by statistical indices.
    Fortunately, an operator for the evaluation of the feature selection is also included in
    the Feature Selection extension for RapidMiner. The operator itself is named
    Feature Selection Stability Validation.
    This operator is somewhat similar to a usual cross validation. It performs an attribute
    weighting on a predefined number of subsets and outputs two stability measures. Detailed options as well as the stability measures will be explained later in this section.
    In order to reliably estimate the stability of a feature selection, one should loop over the
    number of attributes selected in a specic algorithm. For the problem at hand, the process again commences with two Read AML operators that are appended to form a single set of examples. This single example set is then connected to the input port of a Loop Parameters operator. The settings of this operator are rather simple, and are depicted in Figure 16.10.
    The Feature Selection Stability Validation (FSSV) is placed inside the Loop
    Parameters operator accompanied by a simple Log operator (see Figure 16.11). The
    two output ports of the FSSV are connected to the input ports of the Log operators. A
    Log operator stores any selected quantity. For the problem at hand, these are the Jaccard index [13] and Kuncheva's index [14]. The Jaccard index S(Fa; Fb) computes the ratio of the intersection and the union of two feature subsets, Fa and Fb:

    S(Fa, Fb) =
    IFa∩ FbjI  /  IFa∪ FbjI
    :

    The settings for the Log operator are depicted in Figure 16.12. It consists of two fields,
    the first one being column name. Entries can be added and removed using the Add Entry and Remove Entry buttons, respectively. The entry for column name can be basically anything.
    It is helpful to document the meaning of the logged values by a mnemonic name.
    The second field oers a drop-down menu from which any operator of the process can
    be selected. Whether a certain value that is computed during the process or a process
    parameter shall be logged, is selected from the drop-down menu in the third panel. The
    fourth field offers the selection of output values or process parameters, respectively, for the selected operator.
    An operator for attribute weighting is placed inside the FSSV. For the problem at hand,
    Select by MRMR/Cfs is used. However, any other feature selection algorithm can be
    used as well.
    As can be seen, the process for selecting features in a statistically valid and stable
    manner, is quite complex. However, it is also very effective. Here, for a number of attributes between 30 and 40, both stability measures Jaccard and Kuncheva's index lie well above 0.9. Both indices reach the maximum of 1.0, if only one attribute is selected. This indicates that there is one single attribute for the separation of signal and background that is selected under all circumstances. Since other attributes also enhance the learning performance, about 30 more attributes are selected. This substantially decreases the original number of dimensions.


    Link for the companion site:

    I hope that you can do something with that.
    Bonne soirée,
    Maerkli





    varunm1yyhuangMartinLiebigozgeozyazar
Sign In or Register to comment.