Options

Discretize values from "Generate Attributes"

rgkrgk Member Posts: 3 Contributor I
Hi,

i am currently designing a small process to analyse some date/time data, and came to the following problems:

1. I have two columns containing calendar dates in an excel file. After importing (works fine), in the first step, a new attribute is generated by using "Generate Attributes" and calculating the date difference between both (function date_diff()). Results seem fine, the result set shows the correct values in days (I divided the function result by 1000/60/60/24) the meta data view shows the new attribute (role=regular und type=real). In the next step, I would like to discretize the new column using "Discretize by user specification". However, the new generated attribute is not shown in the select drop down box of the discretize operation (neither for single nor subset). I already tried some debugging, e.g. generate a new column of integer, real or numeric attributes using basic operators. That worked fine, discretization was successful. The problem seems to occur only, when I use the date_diff operator in "Generate Attributes". Any ideas?

2. I would like to generate an empty attribute using "Generate Empty Attribute", then calculate a value and store it in the empty attribute. Is there an operator for that? Note: In this case, I do not want to use "Generate Attributes", but rather calculate and store the result in an existing attribute.

Thanks for your help!

Cheers
Ralf



Answers

  • Options
    frasfras Member Posts: 93 Contributor II
    Hi Ralf,

    metadata propagation does not work perfectly. If you know one attribute must be there you
    sometimes have to type it in manually or copy/paste it from the result view (right click on
    attribute name).
    Concerning "Generate Attributes": I do not get the point exactly but you may use it also for
    overwriting every attribute that already exists.
    For further posts: Feel free to paste some XML of your processes for better illustrating your needs.

    -Frank
  • Options
    rgkrgk Member Posts: 3 Contributor I
    Hi Frank,

    thanks for your input on metadata propagation - it actually helped me to at least work around the problem by "tricking" rapidminer out (of course, a correct solution is still misssing). Apparently, rapidminer ist not strongly typed, so you can manually override and get a running process, even if rapidminer states a problem. Here we go...

    1. the problem was like this: take an excel sheet with (an id column and) two colums holding dates (start date and end date). The latter form a time interval in days.
    2. import into rapidminer, calculate the time interval using date_diff(), store the result in a new column (DATE_DIFF).
    3. copy the result using "Generate Copy" into a 4th column called SLICE_DATE_DIFF, then use discretize to slice the intervals into groups.

    Problem: rapidminer does not recognize the result columns DATE_DIFF and SLICE_DATE_DIFF in the discretize operator. It simply does not show them, when using the Attribute Filter Type "single" (or subset), although the input date format (numeric) should be correct.

    Work around: it is possible to enter the name of the attribute in the discretize operator manually ("SLICE_DATE_DIFF"). Rapidminer shows now a warning "Attribute filter does not match any attributes". Still, the process is running and provides an example set that contains both the calculated differences as well as the discretized values.

    While I now have (hopefully correct) results, it would of course be even more satisfying to come up with a formally correct and "problem-free" solution. Any suggestions are highly appreciated!

    I have included the XML code of my small example process below.

    Cheers
    Ralf

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="5.3.015" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
            <parameter key="excel_file" value="C:\Desktop\Sampledata_slice_dates.xls"/>
            <parameter key="imported_cell_range" value="A1:C7"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <parameter key="date_format" value="dd.MM.yyyy"/>
            <parameter key="locale" value="German"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="ID.true.integer.id"/>
              <parameter key="1" value="Date_1.true.date.attribute"/>
              <parameter key="2" value="Date_2.true.date.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="5.3.015" expanded="true" height="76" name="Generate Attributes DATE_DIFF" width="90" x="179" y="75">
            <list key="function_descriptions">
              <parameter key="DATE_DIFF" value="date_diff(Date_1,Date_2)/1000/86400"/>
            </list>
          </operator>
          <operator activated="true" class="generate_copy" compatibility="5.3.015" expanded="true" height="76" name="Generate Copy SLICE_DATE" width="90" x="380" y="75">
            <parameter key="attribute_name" value="DATE_DIFF"/>
            <parameter key="new_name" value="SLICE_DATE_DIFF"/>
          </operator>
          <operator activated="true" class="discretize_by_user_specification" compatibility="5.3.015" expanded="true" height="94" name="Discretize SLICE_DATE" width="90" x="581" y="75">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="SLICE_DATE_DIFF"/>
            <parameter key="include_special_attributes" value="true"/>
            <list key="classes">
              <parameter key="0..1" value="1.0"/>
              <parameter key="2..4" value="4.0"/>
              <parameter key="5+" value="Infinity"/>
            </list>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Generate Attributes DATE_DIFF" to_port="example set input"/>
          <connect from_op="Generate Attributes DATE_DIFF" from_port="example set output" to_op="Generate Copy SLICE_DATE" to_port="example set input"/>
          <connect from_op="Generate Copy SLICE_DATE" from_port="example set output" to_op="Discretize SLICE_DATE" to_port="example set input"/>
          <connect from_op="Discretize SLICE_DATE" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hi Ralf,

    have you tried to run the import wizard first? Usually after doing that, meta data is cached.

    Second "trick": you can use the two linked "rings" in the upper right of the process view. Then meta data is recalculated after executing the operator. Than you can add a breakpoint right before the operator you want to edit.
    Usually this helps to get the meta data propagated.

    Best,

    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    rgkrgk Member Posts: 3 Contributor I
    Hi Martin,

    thanks for the input - I did not know the two rings/chain symbol. However, in this case, it did not seem to help. I tried to first run the import wizard, which did not help, then activated the chain icon and restarted the process. This did not help either. The created attributes where still not available, both when having the discretization step in the process and when bringing it in after running the process.

    But since I know now that I can simply enter the field names in the discretization operator, it's ok for me. At the end of the day, what counts is that the software can do what I want it to.

    Cheers
    Ralf
Sign In or Register to comment.