Bringing arbitrary mathematical functions to RapidMiner for generating data sets

likeasir001likeasir001 Member Posts: 2 Contributor I
edited November 2018 in Help

I have just started writing a thesis at my university where I am supposed to make an analysis of the t-test and my first assignment is to get a bit more familiar with the t-Test Operator which is implemented in RapidMiner and how it actually works.

I should probably mention right away that my knowledge in statistics and hypothesis testing is still rather limited at this point because I am studying mechanical engineering and statistics are not really a big part of our curriculum.

So what I would like to do right now is:

1) generate a data set using a mathematical test function of my choice
2) add noise on that previously created data set
3) build estimation models using the already implemented learning functions like linear regression/polynomial regression etc. and use cross-validation for performance evaluation
3.2) also import the mathematical function that was used before to generate the data for performance measurement
4) perform a t-test using the performance results provided by the different cross-validation operators

So basically what I want to do is to generate data from a mathematical function, add some noise onto that data and then see how well the estimation performance turns out to be if I use the same function that was used to generate the data for performance evaluation.

Let me explain step by step and point out where I need help:

1) generate a data set using a mathematical test function of my choice

I know that I could also do this using Excel and then import the Excel sheet into RapidMiner, but I would like to know if there is a way to directly import/implement a mathematical function.

For example the Rosenbrock function which is F(x,y) = (a-x)²+b*(y-x²)²

or the Three-hump camel function F(x,y) = 2x²-1,05*x^4+(x^6/6)+x*y+y²

I found the operator "Generate daty by User specification", but unfortunately this operator only creates exactly one example and looping it didn't really seem to work because I could not find a way to create one big excel sheet containing all the examples that were generated by the looping operaor.

The standard "generate data" operator lets me choose from a range of different preset functions and I thought about tweaking the java code of one of those operators in order to replace one of the preset ones with the function that I want but unfortunately I am not that familiar with java either and I don't know how I would have to tweak the program so that it would allow me to set two different value ranges for the two variables x and y. The generate data operator only allows to set one range for all attributes.

2) adding noise on the previously generated data

Here I am planning on using the "add noise" operator so that should not be a problem once I have my data set.

3.1) performance evaluation using already existing  regression operators etc.

This should also cause no troubles because here I would only use operators that already exist within RapidMiner.

3.2) performance evaluation using the function that was originally used to generate the data set

This is the second part where I need some help. I know that there is a function called "import model" where I can import for example an xml file which contains my previously used function as a model, but how exactly can I generate such a model-xml file in RapidMiner? Is there some sort of tool or operator that directly "converts" a mathematical function into an equivalent model?

4) performing a t-test

Here I might also need help but that depends on the outcome of the previous steps so it dosn't make much sense to cover it right now.

I would really appreciate some help and I hope my attempt to explain what I am trying to do was comprehensive enough.


  • Options
    awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn

    The 'generate attributes' operator is the one you want to create arbitrary functions. Start with an example set containing x and y attributes and create new attributes how you want. For an example, you could copy this for some pointers.


    You could also add noise as well as calculate other goodness of fit measures using this operator. In fact, one of the advanced videos I recently completed fits an optimum function to some real data using an evolutionary approach. It calculates a global error for a function compared to the data and minimises this by trying different parameters for the function. The heart of this process is 'generate attributes'.

  • Options
    likeasir001likeasir001 Member Posts: 2 Contributor I
    Okay thank you very much for your reply, I will check out your video and get back if I need any further help.

    Edit: Well I just found out that your videos seem to be part of an online course that unfortunately is not free.

    So I managed to use the "Generate Attributes" operator and now have the example set I need.

    It basically has three columns now: X, Y and "function", whereby function could be any mathematical expression, for example (x+y)^2

    Now I added noise onto that data and the next step would be to perfrom a cross validation for performance evaluation. But instead of learning a new function using exisitng operators like "Linear Regression" etc, I would like the X-Validation operator to use the mathematical function that I previously created, so for example the aforementioned function (x+y)^2 using the values X and Y from my exampleset and comparing the result with the "function" attribute from my exampleset.

    The X-Validation Operator asks for a model as the output of the Training section so I am looking for a way to transform a mathematical function of my choice into a "model" so that it can be used within the X-Validation operator.

    P.S.: I am probably not going tu use special values like pi or e or anything like that (at least for now), if that is somehow relevant.
Sign In or Register to comment.