The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

Function description

ThiruThiru Member Posts: 100 Guru
dear all, 

Ive a data set in which age of the subject is given as an attribute and the values  are given in
either months  or years or  in weeks.  

eg:  3 days ,   8 weeks ,  10 months

I want to convert that attribute in to no. of days, so that i can group them based on no. of days.   I was trying
to use functions - 'finds" and "parse", but not successful.  can someone helps me on this. thank you.

regds
thiru

Best Answer

Answers

  • kaymankayman Member Posts: 662 Unicorn
    it's in the generate attribute operator.
    The idea is that you 'regenerate' your existing attribute, so you just use your existing attribute name, but generate new content for it.
    The generate attribute operator contains all the search, replace, splice, trim and other functions you will need
  • ThiruThiru Member Posts: 100 Guru
    @kayman
    thnx for your reply.  I was only referring to the function description in 'generate attribute' operator. 
     I couldn't get  the syntax of the function correctly.  can I have some help to set it right? 
  • kaymankayman Member Posts: 662 Unicorn
    Yeah, takes some getting used to.
    Regular expressions are your friend here, but they can be frightening if you're not used to them.

    Try something as below : 
    (start a new process, copy the xml, open view -> show panel ->xml -> paste -> green tick in top corner to validate and store -> back to process window)

    What is does is create a new field (but you can also overwrite your existing field), uses a regular expression to remove everything that's not a digit (using \D ) and then parses it.

    Now for weeks you can safely multiply by 7, for months it's not so straightforward so I just took an average of 30.
    Finally I used the aggregation operator to sum them all up.

    Note that in reality you can combine all of the above in a single expression using the generate attribute, but it can become a bit unreadable then. 
    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.6.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
            <parameter key="generator_type" value="attribute functions"/>
            <parameter key="number_of_examples" value="1"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions">
              <parameter key="MyDays" value="&quot;3 days&quot;"/>
              <parameter key="MyWeeks" value="&quot;8 weeks&quot;"/>
              <parameter key="MyMonths" value="&quot;10 months&quot;"/>
            </list>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="34">
            <list key="function_descriptions">
              <parameter key="MyParsedDays" value="parse(replaceAll([MyDays],&quot;\\D&quot;,&quot;&quot;))"/>
              <parameter key="MyParsedWeeks" value="parse(replaceAll([MyWeeks],&quot;\\D&quot;,&quot;&quot;))*7"/>
              <parameter key="MyParsedMonths" value="parse(replaceAll([MyMonths],&quot;\\D&quot;,&quot;&quot;))*30"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <operator activated="true" class="generate_aggregation" compatibility="9.6.000" expanded="true" height="82" name="Generate Aggregation" width="90" x="380" y="34">
            <parameter key="attribute_name" value="TotalDays"/>
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="MyParsedDays|MyParsedMonths|MyParsedWeeks"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="aggregation_function" value="sum"/>
            <parameter key="concatenation_separator" value="|"/>
            <parameter key="keep_all" value="true"/>
            <parameter key="ignore_missings" value="true"/>
            <parameter key="ignore_missing_attributes" value="false"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/>
          <connect from_op="Generate Aggregation" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    
    

     
  • ThiruThiru Member Posts: 100 Guru
    @kayman thanks.
    this improves.  But this will work  if consider those values as three different attributes.  But all these  row values are part of single attribute "Age".  I think this will need a different function ?
  • ThiruThiru Member Posts: 100 Guru
    @kayman, thanks.  this improves.
    But this will work if we consider those values as three different attributes. 
    But there, all these row values are part of single attribute 'age'.  I think this will need a 
    different function?
  • kaymankayman Member Posts: 662 Unicorn
    Ah, wasn't that clear to me. In the end it means just a bit more complex find and replace logic. 

    something like this : 

    input 3 days ,   8 weeks ,  10 months
    output = 359

    <?xml version="1.0" encoding="UTF-8"?><process version="9.6.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.6.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.6.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="179" y="34">
            <parameter key="generator_type" value="attribute functions"/>
            <parameter key="number_of_examples" value="1"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions">
              <parameter key="myField" value="&quot;3 days ,   8 weeks ,  10 months&quot;"/>
            </list>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.6.000" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="34">
            <list key="function_descriptions">
              <parameter key="days" value="parse(replaceAll(myField,&quot;(\\d+) days?[ ,]+(\\d+) weeks?[ ,]+(\\d+) months?&quot;,&quot;$1&quot;))"/>
              <parameter key="days" value="days + (parse(replaceAll(myField,&quot;(\\d+) days?[ ,]+(\\d+) weeks?[ ,]+(\\d+) months?&quot;,&quot;$2&quot;))*7)"/>
              <parameter key="days" value="days + (parse(replaceAll(myField,&quot;(\\d+) days?[ ,]+(\\d+) weeks?[ ,]+(\\d+) months?&quot;,&quot;$3&quot;))*30)"/>
            </list>
            <parameter key="keep_all" value="true"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    


  • ThiruThiru Member Posts: 100 Guru
    @kayman,  thanks for your reply.  I m sorry that I didnt make it clear to understand the data . 
    All these are different rows of a single attribute.  Means  - 3 days can be one row,  8 weeks another row , 10 months
     another row.. , 6 years can be an another one.  like that  there are many rows. 
    Im enclosing the sample of that attribute " Age pet".  kindly have a look on  it. 
Sign In or Register to comment.