Generating a data set for testing

pettudorpettudor Member Posts: 2 Contributor I
edited December 2018 in Help

Hello,

 

Computer engineer student here, new to data science but what I want is fairly simple in notion but I couldn't find the right operators to do it yet or maybe I have and don't know how to use them, so here we go:

 

1.I have 22 attributes, 20 of which I want them to be integers that very from 0.2 to 2.8 depending on the attribute (the first 2 are just strings).

2.Is there a way to generate with dependency on what was generate before, need an example to explain better, lets say we have one example with attribute 1 that generated 1.4 that's, 0.4 above average for that specific attribute, so the next one, attribute 2, will generate 0.9 (0.5 which is the average for that attribute + the difference from the one before 0.4 so 0.5+0.4) making the generation pseudo-random.

 

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
<operator activated="true" class="generate_data_user_specification" compatibility="8.1.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="313" y="136">
<list key="attribute_values">
<parameter key="ID" value="NOMINAL"/>
<parameter key="Name" value="NOMINAL"/>
<parameter key="P1" value="REAL"/>
<parameter key="P2" value="REAL"/>
<parameter key="P3" value="REAL"/>
<parameter key="P4" value="REAL"/>
<parameter key="P5" value="REAL"/>
<parameter key="P6" value="REAL"/>
<parameter key="P7" value="REAL"/>
<parameter key="P8" value="REAL"/>
<parameter key="P9" value="REAL"/>
<parameter key="P10" value="REAL"/>
<parameter key="P11" value="REAL"/>
<parameter key="P12" value="REAL"/>
<parameter key="P13" value="REAL"/>
<parameter key="P14" value="REAL"/>
<parameter key="P15" value="REAL"/>
<parameter key="P16" value="REAL"/>
<parameter key="P17" value="REAL"/>
<parameter key="P18" value="REAL"/>
<parameter key="P19" value="REAL"/>
<parameter key="P20" value="REAL"/>
</list>
<list key="set_additional_roles">
<parameter key="ID" value="id"/>
<parameter key="Name" value="label"/>
</list>
</operator>
</process>

I am definitely  doing something wrong :smileysad:

Tagged:

Best Answer

  • kypexinkypexin Posts: 280   Unicorn
    Solution Accepted

    Hi @pettudor

     

     


    2.Is there a way to generate with dependency on what was generate before, need an example to explain better,

    lets say we have one example with attribute 1 that generated 1.4 that's, 0.4 above average for that specific attribute


    I am a bit confused with the description. 

     

    The answer for the first part is yes, there is an operator 'Generate attributes' that allows you to construct new attributes based on already existing ones, and that's pretty easy. You even may do some aggregations so that you can generate new attributes based not only on existing previous values, but also using such aggregated values like mean, median, sum etc etc. 

     

    The second part though is confusing. You say this first attribute woul have value = 1.4 for some certain example, but what exactly this value is based upon? You need either to generate the first attribute pseudo-randomly, or base its values on already existing data. 

     

    Could you please clarify?

Answers

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,434  Community Manager

    hi @pettudor welcome to the community. So first I want to say CONGRATULATIONS - you're the first "newbie" I have seen in a long while who actually read the directions and posted their XML process with their first post. :) :) :)

     

    So back to your question....so I'm not sure if you have 22 attributes from your own data set, or you want to create 22 attributes from random data. If it's the former, just use the "Add Data" wizard in the Repository panel and go through the steps:

     

    Screen Shot 2018-04-04 at 8.50.26 AM.png

     

    If you want to create random data, use the "Generate Data" operator rather than the "Generate Data by User Specification"":

     

    Screen Shot 2018-04-04 at 8.52.14 AM.png

     

    The default for this is to create six attributes: five "regular" attributes of real numbers, and one "label" attribute with real numbers:

     Screen Shot 2018-04-04 at 8.54.50 AM.pngScreen Shot 2018-04-04 at 8.53.38 AM.pngScreen Shot 2018-04-04 at 8.53.45 AM.png

     

    You can then modify these with other operators to make them strings, integers, etc...:

     

    Screen Shot 2018-04-04 at 8.59.19 AM.png

     

    Let me know if that makes sense.

     

    Scott

     

    pettudor
  • pettudorpettudor Member Posts: 2 Contributor I

    So after the generation of one attribute of 100 random examples I just used the operator generate attribute, gave it a dependency formula and bobs your uncle I have what I want.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
    <operator activated="true" class="generate_data" compatibility="8.1.001" expanded="true" height="68" name="Generate Data" width="90" x="313" y="238">
    <parameter key="target_function" value="random"/>
    <parameter key="number_examples" value="100"/>
    <parameter key="number_of_attributes" value="1"/>
    <parameter key="attributes_lower_bound" value="1.8"/>
    <parameter key="attributes_upper_bound" value="2.3"/>
    <parameter key="gaussian_standard_deviation" value="10.0"/>
    <parameter key="largest_radius" value="10.0"/>
    <parameter key="use_local_random_seed" value="false"/>
    <parameter key="local_random_seed" value="1992"/>
    <parameter key="datamanagement" value="double_array"/>
    <parameter key="data_management" value="auto"/>
    </operator>
    </process>
    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
    <operator activated="true" class="generate_attributes" compatibility="8.1.001" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="238">
    <list key="function_descriptions">
    <parameter key="P2" value="sum((att1-2),1.9)"/>
    </list>
    <parameter key="keep_all" value="true"/>
    </operator>
    </process>

    Added the code, such an easy task in reprospect :catfrustrated:

    Must thank you all for the patience of reading this mess of a post, have a great day.

    sgenzerkypexin
Sign In or Register to comment.