# Bringing arbitrary mathematical functions to RapidMiner for generating data sets

likeasir001
Member Posts:

**2**Contributor I
Hello,

I have just started writing a thesis at my university where I am supposed to make an analysis of the t-test and my first assignment is to get a bit more familiar with the t-Test Operator which is implemented in RapidMiner and how it actually works.

I should probably mention right away that my knowledge in statistics and hypothesis testing is still rather limited at this point because I am studying mechanical engineering and statistics are not really a big part of our curriculum.

So what I would like to do right now is:

1) generate a data set using a mathematical test function of my choice

2) add noise on that previously created data set

3) build estimation models using the already implemented learning functions like linear regression/polynomial regression etc. and use cross-validation for performance evaluation

3.2) also import the mathematical function that was used before to generate the data for performance measurement

4) perform a t-test using the performance results provided by the different cross-validation operators

So basically what I want to do is to generate data from a mathematical function, add some noise onto that data and then see how well the estimation performance turns out to be if I use the same function that was used to generate the data for performance evaluation.

Let me explain step by step and point out where I need help:

1) generate a data set using a mathematical test function of my choice

I know that I could also do this using Excel and then import the Excel sheet into RapidMiner, but I would like to know if there is a way to directly import/implement a mathematical function.

For example the Rosenbrock function which is F(x,y) = (a-x)²+b*(y-x²)²

or the Three-hump camel function F(x,y) = 2x²-1,05*x^4+(x^6/6)+x*y+y²

I found the operator "Generate daty by User specification", but unfortunately this operator only creates exactly one example and looping it didn't really seem to work because I could not find a way to create one big excel sheet containing all the examples that were generated by the looping operaor.

The standard "generate data" operator lets me choose from a range of different preset functions and I thought about tweaking the java code of one of those operators in order to replace one of the preset ones with the function that I want but unfortunately I am not that familiar with java either and I don't know how I would have to tweak the program so that it would allow me to set two different value ranges for the two variables x and y. The generate data operator only allows to set one range for all attributes.

2) adding noise on the previously generated data

Here I am planning on using the "add noise" operator so that should not be a problem once I have my data set.

3.1) performance evaluation using already existing regression operators etc.

This should also cause no troubles because here I would only use operators that already exist within RapidMiner.

3.2) performance evaluation using the function that was originally used to generate the data set

This is the second part where I need some help. I know that there is a function called "import model" where I can import for example an xml file which contains my previously used function as a model, but how exactly can I generate such a model-xml file in RapidMiner? Is there some sort of tool or operator that directly "converts" a mathematical function into an equivalent model?

4) performing a t-test

Here I might also need help but that depends on the outcome of the previous steps so it dosn't make much sense to cover it right now.

I would really appreciate some help and I hope my attempt to explain what I am trying to do was comprehensive enough.

I have just started writing a thesis at my university where I am supposed to make an analysis of the t-test and my first assignment is to get a bit more familiar with the t-Test Operator which is implemented in RapidMiner and how it actually works.

I should probably mention right away that my knowledge in statistics and hypothesis testing is still rather limited at this point because I am studying mechanical engineering and statistics are not really a big part of our curriculum.

So what I would like to do right now is:

1) generate a data set using a mathematical test function of my choice

2) add noise on that previously created data set

3) build estimation models using the already implemented learning functions like linear regression/polynomial regression etc. and use cross-validation for performance evaluation

3.2) also import the mathematical function that was used before to generate the data for performance measurement

4) perform a t-test using the performance results provided by the different cross-validation operators

So basically what I want to do is to generate data from a mathematical function, add some noise onto that data and then see how well the estimation performance turns out to be if I use the same function that was used to generate the data for performance evaluation.

Let me explain step by step and point out where I need help:

1) generate a data set using a mathematical test function of my choice

I know that I could also do this using Excel and then import the Excel sheet into RapidMiner, but I would like to know if there is a way to directly import/implement a mathematical function.

For example the Rosenbrock function which is F(x,y) = (a-x)²+b*(y-x²)²

or the Three-hump camel function F(x,y) = 2x²-1,05*x^4+(x^6/6)+x*y+y²

I found the operator "Generate daty by User specification", but unfortunately this operator only creates exactly one example and looping it didn't really seem to work because I could not find a way to create one big excel sheet containing all the examples that were generated by the looping operaor.

The standard "generate data" operator lets me choose from a range of different preset functions and I thought about tweaking the java code of one of those operators in order to replace one of the preset ones with the function that I want but unfortunately I am not that familiar with java either and I don't know how I would have to tweak the program so that it would allow me to set two different value ranges for the two variables x and y. The generate data operator only allows to set one range for all attributes.

2) adding noise on the previously generated data

Here I am planning on using the "add noise" operator so that should not be a problem once I have my data set.

3.1) performance evaluation using already existing regression operators etc.

This should also cause no troubles because here I would only use operators that already exist within RapidMiner.

3.2) performance evaluation using the function that was originally used to generate the data set

This is the second part where I need some help. I know that there is a function called "import model" where I can import for example an xml file which contains my previously used function as a model, but how exactly can I generate such a model-xml file in RapidMiner? Is there some sort of tool or operator that directly "converts" a mathematical function into an equivalent model?

4) performing a t-test

Here I might also need help but that depends on the outcome of the previous steps so it dosn't make much sense to cover it right now.

I would really appreciate some help and I hope my attempt to explain what I am trying to do was comprehensive enough.

0

## Answers

458UnicornThe 'generate attributes' operator is the one you want to create arbitrary functions. Start with an example set containing x and y attributes and create new attributes how you want. For an example, you could copy this for some pointers.

http://rapidminernotes.blogspot.co.uk/2014/08/mandelbrot.html

You could also add noise as well as calculate other goodness of fit measures using this operator. In fact, one of the advanced videos I recently completed fits an optimum function to some real data using an evolutionary approach. It calculates a global error for a function compared to the data and minimises this by trying different parameters for the function. The heart of this process is 'generate attributes'.

Andrew

2Contributor IEdit: Well I just found out that your videos seem to be part of an online course that unfortunately is not free.

So I managed to use the "Generate Attributes" operator and now have the example set I need.

It basically has three columns now: X, Y and "function", whereby function could be any mathematical expression, for example (x+y)^2

Now I added noise onto that data and the next step would be to perfrom a cross validation for performance evaluation. But instead of learning a new function using exisitng operators like "Linear Regression" etc, I would like the X-Validation operator to use the mathematical function that I previously created, so for example the aforementioned function (x+y)^2 using the values X and Y from my exampleset and comparing the result with the "function" attribute from my exampleset.

The X-Validation Operator asks for a model as the output of the Training section so I am looking for a way to transform a mathematical function of my choice into a "model" so that it can be used within the X-Validation operator.

P.S.: I am probably not going tu use special values like pi or e or anything like that (at least for now), if that is somehow relevant.