How to generate data points for each row of data based on a frequency and average value?
Suppose you have rows of data in the following form:
City Population Average Income
--------------------------------------------------
CityA 100,000 60,000
CityB 300,000 40,000
CityC 40,000 70,000
I would like to generate rows with data points based on a given (typically normal) distribution.
Thus using the above example we would generate 100,000 + 300,000 + 40,000 = 440,000 rows each containing an actual (but hypothetical) income based on a given (typically normal) distribution of income of the city in question.
Best Answer
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Hi @caryknoop,
If you did'nt find an operator in RapidMiner which perform what you want,
you can find here a process using Execute Python operator (if the Python environment is installed on your computer).
You have just to set the standard deviations associated to the towns in the code :
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
<parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Generate_Income.xlsx"/>
<parameter key="imported_cell_range" value="A1:D4"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="City.true.polynominal.attribute"/>
<parameter key="1" value="Population.true.integer.attribute"/>
<parameter key="2" value="Average.true.integer.attribute"/>
<parameter key="3" value="Income.true.attribute_value.attribute"/>
</list>
</operator>
<operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Calculate Income" width="90" x="179" y="34">
<parameter key="script" value="from numpy.random import normal # rm_main is a mandatory function, # the number of arguments has to be the number of input ports (can be none) def rm_main(data): # set the standard deviation associated to the towns : #the first element is std dev from CityA, the second from City B etc. std_deviation = [1,2,3] data['pop_cum'] = data['Population'] for i in range(1,len(data)) : data.loc[i,'pop_cum'] = data.loc[i-1,'pop_cum'] + data.loc[i,'Population'] for j in range(0,int(data.loc[0,'Population'])): data.loc[j,'Income'] = normal(data.loc[0,'Average'],std_deviation[0]) try: for i in range(1,len(data)) : for j in range(int(data.loc[i-1,'pop_cum']),int(data.loc[i,'pop_cum'])): data.loc[j,'Income'] = normal(data.loc[i,'Average'],std_deviation[i]) except ValueError: del data['pop_cum'] exit # connect 1 output port to see the results return data"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Calculate Income" to_port="input 1"/>
<connect from_op="Calculate Income" from_port="output 1" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Regards,
Lionel
NB : I used the name of attributes of your example
NB2 : In attached file, an excel example file
2
Answers
To my knowledge, this cannot be done in RM. RM does not have any random number generators. Of course, it is trivial to do in R (or Python) and you can do it inside RM using the R extension.
Hi @caryknoop,
I think it can be done using the Execute Script operator (using Java language) :
Here a ressource of @mschmitz about generating example set :
How to Create Example Sets Using Groovy Script
I hope it can help you
Regards,
Lionel
I believe you use the Generate Data by User Specification to do this. There is an editor that's like the Generate Attributes operator, you can create a fuction based on a STD dev or something.
Actually I think "Generate Data" can be used to do what you want along with "Generate Guassian". You simply specify the number of examples you want and then the mean and standard variation.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
I would love to use 'generate data' for this but given 'generate data' has no input connections I cannot see how this could work based on input rows.