# How to generate data points for each row of data based on a frequency and average value?

Member Posts: 4 Contributor I
edited November 2018 in Help

Suppose you have rows of data in the following form:

City      Population  Average Income

--------------------------------------------------

CityA   100,000       60,000

CityB   300,000       40,000

CityC     40,000       70,000

I would like to generate rows with data points based on a given (typically normal) distribution.

Thus using the above example we would generate 100,000 + 300,000 + 40,000 = 440,000 rows each containing an actual (but hypothetical) income based on a given (typically normal) distribution of income of the city in question.

• Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Solution Accepted

Hi @caryknoop,

If you did'nt find an operator in RapidMiner which perform what you want,

you can find here a process using Execute Python operator (if the Python environment is installed on your computer).

You have just to set the standard deviations associated to the towns in the code :

`<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">  <context>    <input/>    <output/>    <macros/>  </context>  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">    <process expanded="true">      <operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Generate_Income.xlsx"/>        <parameter key="imported_cell_range" value="A1:D4"/>        <parameter key="first_row_as_names" value="false"/>        <list key="annotations">          <parameter key="0" value="Name"/>        </list>        <list key="data_set_meta_data_information">          <parameter key="0" value="City.true.polynominal.attribute"/>          <parameter key="1" value="Population.true.integer.attribute"/>          <parameter key="2" value="Average.true.integer.attribute"/>          <parameter key="3" value="Income.true.attribute_value.attribute"/>        </list>      </operator>      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Calculate Income" width="90" x="179" y="34">        <parameter key="script" value="from numpy.random import normal&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10;  # set the standard deviation associated to the towns : &#10;  #the first element is std dev from CityA, the second from City B etc.&#10;  std_deviation =  [1,2,3]&#10;&#10;&#10;  data['pop_cum'] = data['Population'] &#10;  &#10;  for i in range(1,len(data)) :&#10;  &#10;    data.loc[i,'pop_cum'] = data.loc[i-1,'pop_cum'] +  data.loc[i,'Population']&#10;    &#10;    &#10;  for j in range(0,int(data.loc[0,'Population'])):&#10;   &#10;    data.loc[j,'Income'] = normal(data.loc[0,'Average'],std_deviation)&#10;&#10;  try:&#10;    &#10;    for i in range(1,len(data)) : &#10;    &#10;      for j in range(int(data.loc[i-1,'pop_cum']),int(data.loc[i,'pop_cum'])):&#10;    &#10;        data.loc[j,'Income'] =  normal(data.loc[i,'Average'],std_deviation[i])&#10;&#10;  except ValueError:&#10;    &#10;    del data['pop_cum']&#10;    exit&#10;&#10;    # connect 1 output port to see the results&#10;  return data"/>      </operator>      <connect from_op="Read Excel" from_port="output" to_op="Calculate Income" to_port="input 1"/>      <connect from_op="Calculate Income" from_port="output 1" to_port="result 1"/>      <portSpacing port="source_input 1" spacing="0"/>      <portSpacing port="sink_result 1" spacing="0"/>      <portSpacing port="sink_result 2" spacing="0"/>    </process>  </operator></process>`

Regards,

Lionel

NB : I used the name of attributes of your example

NB2 : In attached file, an excel example file

• Member Posts: 270 Unicorn

To my knowledge, this cannot be done in RM. RM does not have any random number generators. Of course, it is trivial to do in R (or Python) and you can do it inside RM using the R extension.

• Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

Hi @caryknoop,

I think it can be done using the Execute Script operator (using Java language) :

Here a ressource of @mschmitz about generating example set :

How to Create Example Sets Using Groovy Script

Regards,

Lionel

• RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

I believe you use the Generate Data by User Specification to do this. There is an editor that's like the Generate Attributes operator, you can create a fuction based on a STD dev or something.

• Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

Actually I think "Generate Data" can be used to do what you want along with "Generate Guassian".  You simply specify the number of examples you want and then the mean and standard variation.

Brian T.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
• Member Posts: 4 Contributor I

I would love to use 'generate data' for this but given 'generate data' has no input connections I cannot see how this could work based on input rows.