"Bug in Feature Generation: side effects"

steffensteffen Member Posts: 347 Maven
edited May 2019 in Help
Hello RapidMiner Team

I am using the latest cvs-version and tried to implement the ZTransformation. That means, calculating mean and std from input ExampleSet and then apply a series of RM-Operators, calling them within my code. Trying some preprocessing steps before my operator, I stepped over the strange behaviour of the FeatureGenerationOperator, which I also use. Then I simulated the Code in a process, using only RM-builtin-Operator. The strange things happened again. Two notes regarding the following setups:
  • The "useless" re-naming I got to perform because (originally) I wanted to use an attributenname containing a "(" within FeatureGeneration (confidence...)
  • In the following setups I used the dataset described by golf.aml delivered with the RM-distribution.
1. Here is my basic setup...which works!

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSource" class="ExampleSource">
        <parameter key="attributes" value="golf.aml"/>
    </operator>
    <operator name="Temperature->ijon" class="ChangeAttributeName">
        <parameter key="new_name" value="ijon"/>
        <parameter key="old_name" value="Temperature"/>
    </operator>
    <operator name="apply_ztrans" class="FeatureGeneration">
        <list key="functions">
          <parameter key="tichy" value="/(-(ijon,const[73.571]()),const[6.3326]())"/>
        </list>
        <parameter key="keep_all" value="true"/>
    </operator>
    <operator name="skip_ijon" class="FeatureNameFilter">
        <parameter key="filter_special_features" value="true"/>
        <parameter key="skip_features_with_name" value="ijon"/>
    </operator>
    <operator name="tichy->Temperature" class="ChangeAttributeName">
        <parameter key="new_name" value="Temperature"/>
        <parameter key="old_name" value="tichy"/>
    </operator>
</operator>

2. But accidently using the wrong attributename within FeatureGeneration, no error message appeared, but this wrong result.

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSource" class="ExampleSource">
        <parameter key="attributes" value="golf.aml"/>
    </operator>
    <operator name="Temperature->ijon" class="ChangeAttributeName">
        <parameter key="new_name" value="ijon"/>
        <parameter key="old_name" value="Temperature"/>
    </operator>
    <operator name="apply_ztrans" class="FeatureGeneration">
        <list key="functions">
          <parameter key="tichy" value="/(-(Temperature,const[73.571]()),const[6.3326]())"/>
        </list>
        <parameter key="keep_all" value="true"/>
    </operator>
    <operator name="skip_ijon" class="FeatureNameFilter">
        <parameter key="filter_special_features" value="true"/>
        <parameter key="skip_features_with_name" value="ijon"/>
    </operator>
    <operator name="tichy->Temperature" class="ChangeAttributeName">
        <parameter key="new_name" value="Temperature"/>
        <parameter key="old_name" value="tichy"/>
    </operator>
</operator>

3. Setting the correct names, but applying  the Sorting-Operator before causes the same results as in step 2.

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSource" class="ExampleSource">
        <parameter key="attributes" value="golf.aml"/>
    </operator>
    <operator name="sort_temperature" class="Sorting">
        <parameter key="attribute_name" value="Temperature"/>
    </operator>
    <operator name="Temperature->ijon" class="ChangeAttributeName">
        <parameter key="new_name" value="ijon"/>
        <parameter key="old_name" value="Temperature"/>
    </operator>
    <operator name="apply_ztrans" class="FeatureGeneration">
        <list key="functions">
          <parameter key="tichy" value="/(-(ijon,const[73.571]()),const[6.3326]())"/>
        </list>
        <parameter key="keep_all" value="true"/>
    </operator>
    <operator name="skip_ijon" class="FeatureNameFilter">
        <parameter key="filter_special_features" value="true"/>
        <parameter key="skip_features_with_name" value="ijon"/>
    </operator>
    <operator name="tichy->Temperature" class="ChangeAttributeName">
        <parameter key="new_name" value="Temperature"/>
        <parameter key="old_name" value="tichy"/>
    </operator>
</operator>

At this point I came to the conclusion, that the problem must lurk deeply in the RapidMiner entrails ...

Hope this error-desription was somehow helpful

greetings

Steffen

Answers

  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 294 RM Product Management
    Hi Steffen,

    wow, what a wonderful detailed bug report. I must admit, I just browsed over it since I have not that much time today, but I will have a closer look at it on Monday, if nobody else will have done so until then ...  ;)

    Regards,
    Tobias
  • steffensteffen Member Posts: 347 Maven
    Hello RapidMiner-Team

    I just checked out the 4.2 Release and it seems, that this bug is still there. I will open a ticket now, because I guess it is easier to keep track of such things in the huge amount of work you got to do. I thought about it before, but I didnt want to be annoying  ;)

    beside this ... keep up the good work !

    greetings

    Steffen
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hey,

    thanks for the reminder. We indeed missed this, sorry.

    Cheers,
    Ingo
  • Legacy UserLegacy User Member Posts: 0 Newbie
    Hi Steffen, Hi Tobias, Hi All,

    I have found a bug in FeatureGeneration too (maybe the same ?), but strangely there is the same kind of bug in AttributeConstructionLoader.

    The basic idea of my experiment is to merge two lexical matrices in text mining : I have 10 documents in ".doc" format, 13 in "pdf", I use a "TextInput" subtree for each but I have to merge two examplesets with different lines and different atttributes.

    I have tried "ExampesetMerge/Join/cartesian", none of them are satisfactory. Now I tried AttributeConstructionLoader and FeatureGeneration, both using "keep all=true" and "filepath= true" options, but I have such a message :
    "The function name 'const' must be used with empty arguments".

    Here is my experiment :

    <operator name="Deux_repertoires" class="Process" expanded="yes">
        <description text="analyse du premier repertoire pour mise en forme#ylt#br#ygt#analyse du deuxieme repertoire#ylt#br#ygt#analyse_croisee, voir avec #yquot#attribute construction loader#yquot#"/>
        <parameter key="encoding" value="win-1250"/>
        <operator name="Country" class="OperatorChain" expanded="yes">
            <operator name="ExampleSource" class="ExampleSource">
                <parameter key="attributes" value="D:\users\default\project\base_doc\fichiers_croises\dummy\file_doc.aml"/>
            </operator>
            <operator name="IdTagging" class="IdTagging">
            </operator>
            <operator name="FeatureGeneration" class="FeatureGeneration">
                <parameter key="filename" value="D:\users\default\project\base_doc\fichiers_croises\dummy\attributs_html.att"/>
                <list key="functions">
                </list>
                <parameter key="keep_all" value="true"/>
            </operator>
            <operator name="ExampleSource (2)" class="ExampleSource">
                <parameter key="attributes" value="D:\users\default\project\base_doc\fichiers_croises\dummy\file_html.aml"/>
            </operator>
            <operator name="FeatureGeneration (2)" class="FeatureGeneration">
                <parameter key="filename" value="D:\users\default\project\base_doc\fichiers_croises\dummy\file_cross.att"/>
                <list key="functions">
                </list>
                <parameter key="keep_all" value="true"/>
            </operator>
        </operator>
        <operator name="Elements" class="OperatorChain" activated="no" expanded="yes">
        </operator>
        <operator name="Croisement" class="OperatorChain" activated="no" expanded="yes">
        </operator>
    </operator>


    Is this the known behaviour steffen has been talking about ?
    Cheers,
      Jean-Charles.
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi Steffen, Hi Jean-Charles,

    so, back again to feature generation. First, some comments on Steffen's Report:

    At this point I came to the conclusion, that the problem must lurk deeply in the RapidMiner entrails ...
    Yes, it is. Very deep. We have two different data structures (actually only one data structure and a view structure) for the data we handle. First, the ExampleTable which actually holds the data and the ExampleSets which define views on the underlying tables. All operators work on the ExampleSets with one exception: the feature generation operators directly work on the tables for performance reasons and to easily share newly generated attributes among views without the need for re-creation. This is, for example, useful for the evolutionary feature construction approaches.

    However, changing the underlying table columns without "notifying" the view columns (attributes) might lead to some strange behaviour. For that reason, one simply have to copy the attribute (I kept the renaming) like in the following process. Then it works with both attribute names in the construction:

    <operator name="Root" class="Process" expanded="yes">
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="C:\Dokumente und Einstellungen\Mierswa\Eigene Dateien\rm_workspace\sample\data\golf.aml"/>
        </operator>
        <operator name="AttributeCopy" class="AttributeCopy">
            <parameter key="attribute_name" value="Temperature"/>
            <parameter key="new_name" value="ijon"/>
        </operator>
        <operator name="Temperature->ijon" class="ChangeAttributeName" activated="no">
            <parameter key="new_name" value="ijon"/>
            <parameter key="old_name" value="Temperature"/>
        </operator>
        <operator name="apply_ztrans" class="FeatureGeneration">
            <list key="functions">
              <parameter key="tichy" value="/(-(Temperature,const[73.571]()),const[6.3326]())"/>
            </list>
            <parameter key="keep_all" value="true"/>
        </operator>
        <operator name="skip_ijon" class="FeatureNameFilter">
            <parameter key="filter_special_features" value="true"/>
            <parameter key="skip_features_with_name" value="ijon"/>
        </operator>
        <operator name="skip_Temperature" class="FeatureNameFilter">
            <parameter key="filter_special_features" value="true"/>
            <parameter key="skip_features_with_name" value="Temperature"/>
        </operator>
        <operator name="tichy->Temperature" class="ChangeAttributeName">
            <parameter key="new_name" value="Temperature"/>
            <parameter key="old_name" value="tichy"/>
        </operator>
    </operator>

    About the attribute construction loading: please use the operator "AttributeConstructionLoader" instread. The file parameter for the "FeatureGeneration" operator is sort of deprecated (unfortunately, we cannot mark this for parameters) and is only left in for backwards compatibility reasons.


    However, just a small comment on the whole feature generation stuff: we will revise the feature generation algorithms until the next release anyway in order to ease the generation process and allow more generation types.

    Cheers,
    Ingo
  • steffensteffen Member Posts: 347 Maven
    Hello Ingo

    Thank you for the workaround !

    However, just a small comment on the whole feature generation stuff: we will revise the feature generation algorithms until the next release anyway in order to ease the generation process and allow more generation types.
    This would be nice. Did you consider using a language like JavaScript for user-defined functions ? Something like the "Modified Java Script Value" in Pentaho Kettle ? Beside "click-it-together-functions" it would be nice to have something powerful for the users with a stronger programming background.

    greetings

    Steffen
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi again,

    we actually also thought of a scripting engine for user defined functions which should be supported in Java 6 anyway (at least JavaScript should be supported).


    For the more "traditional" mathematical functions we currently evaluate JEP:

    http://www.singularsys.com/jep/index.html

    which would really nicely fit into RapidMiner.


    Any thoughts about this?

    Cheers,
    Ingo
Sign In or Register to comment.