Options

How to generate data points for each row of data based on a frequency and average value?

caryknoopcaryknoop Member Posts: 4 Contributor I
edited December 2018 in Help

Suppose you have rows of data in the following form:

 

City      Population  Average Income

--------------------------------------------------

CityA   100,000       60,000

CityB   300,000       40,000

CityC     40,000       70,000

 

I would like to generate rows with data points based on a given (typically normal) distribution.  

Thus using the above example we would generate 100,000 + 300,000 + 40,000 = 440,000 rows each containing an actual (but hypothetical) income based on a given (typically normal) distribution of income of the city in question.

 

Best Answer

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
    Solution Accepted

    Hi @caryknoop,

     

    If you did'nt find an operator in RapidMiner which perform what you want, 

    you can find here a process using Execute Python operator (if the Python environment is installed on your computer).

    You have just to set the standard deviations associated to the towns in the code : 

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
    <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Generate_Income.xlsx"/>
    <parameter key="imported_cell_range" value="A1:D4"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations">
    <parameter key="0" value="Name"/>
    </list>
    <list key="data_set_meta_data_information">
    <parameter key="0" value="City.true.polynominal.attribute"/>
    <parameter key="1" value="Population.true.integer.attribute"/>
    <parameter key="2" value="Average.true.integer.attribute"/>
    <parameter key="3" value="Income.true.attribute_value.attribute"/>
    </list>
    </operator>
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Calculate Income" width="90" x="179" y="34">
    <parameter key="script" value="from numpy.random import normal&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10; # set the standard deviation associated to the towns : &#10; #the first element is std dev from CityA, the second from City B etc.&#10; std_deviation = [1,2,3]&#10;&#10;&#10; data['pop_cum'] = data['Population'] &#10; &#10; for i in range(1,len(data)) :&#10; &#10; data.loc[i,'pop_cum'] = data.loc[i-1,'pop_cum'] + data.loc[i,'Population']&#10; &#10; &#10; for j in range(0,int(data.loc[0,'Population'])):&#10; &#10; data.loc[j,'Income'] = normal(data.loc[0,'Average'],std_deviation[0])&#10;&#10; try:&#10; &#10; for i in range(1,len(data)) : &#10; &#10; for j in range(int(data.loc[i-1,'pop_cum']),int(data.loc[i,'pop_cum'])):&#10; &#10; data.loc[j,'Income'] = normal(data.loc[i,'Average'],std_deviation[i])&#10;&#10; except ValueError:&#10; &#10; del data['pop_cum']&#10; exit&#10;&#10; # connect 1 output port to see the results&#10; return data"/>
    </operator>
    <connect from_op="Read Excel" from_port="output" to_op="Calculate Income" to_port="input 1"/>
    <connect from_op="Calculate Income" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Regards,

     

    Lionel

     

    NB : I used the name of attributes of your example

    NB2 : In attached file, an excel example file

     

Answers

  • Options
    earmijoearmijo Member Posts: 270 Unicorn

    To my knowledge, this cannot be done in RM. RM does not have any random number generators. Of course, it is trivial to do in R (or Python) and you can do it inside RM using the R extension.

  • Options
    lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn

    Hi @caryknoop,

     

    I think it can be done using the Execute Script operator (using Java language) : 

    Here a ressource of @mschmitz about generating example set : 

     

    How to Create Example Sets Using Groovy Script

     

    I hope it can help you

     

    Regards,

     

    Lionel

  • Options
    Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    I believe you use the Generate Data by User Specification to do this. There is an editor that's like the Generate Attributes operator, you can create a fuction based on a STD dev or something. 

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Actually I think "Generate Data" can be used to do what you want along with "Generate Guassian".  You simply specify the number of examples you want and then the mean and standard variation.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    caryknoopcaryknoop Member Posts: 4 Contributor I

    I would love to use 'generate data' for this but given 'generate data' has no input connections I cannot see how this could work based on input rows.

Sign In or Register to comment.