Options

"[SOLVED] Rapidminer / Excel Missing Value editing"

dramaticlookdramaticlook Member Posts: 5 Contributor II
edited June 2019 in Help
Hi all,
Im learning how to use Rapidminer for a project. Im stuck at some point. I have a dataset as follows: There are countries. For each country Im keeping track of some values (medals lets say) for years 1990-2012. As an example:
   

- Country       Year   Gold   Silver    Bronze
-----------------------------------------
       USA      1990    10      5         7
       .....
       USA      2012    12      3         8
       Spain    1990    8       12        9
       ...
       Spain    1992    7       ?         8
       ....
       Spain    2012    4       11        12
       ...GOES ON...

What I want to do is to replace the missing values. For example Spain has a missing value in 1992 for Silver Medals. I want to find the average for Silver data available for Spain and replace the missing value with that. How can I do this? If the present modules in Rapidminer not able to do this, is there some kind of macro etc? I can also use Excel to preprocess the data (but how)???.
Tagged:

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hey, easy exercise with RapidMiner ;)

    Did you manage to load the data, and does RapidMiner correctly recognize the missing values? Then you can simply use the Replace Missing Values operator and define "average" as the replacement strategy.

    Best, Marius
  • Options
    dramaticlookdramaticlook Member Posts: 5 Contributor II
    Hey thanks for replying I have loaded the data and I can clearly see the missing vals. I used Replace Missing Attributes however its not what I exactly want. I want to use the average for that country's data instead of average of the whole attribute.
  • Options
    RLorrigRLorrig Member Posts: 4 Contributor I
    I'd suggest to sort your data by the country attribute and then use the "filter example range" operator to separate your data by countrys. Afterswards you can use the replace missing attributes operator. As long as you havn't got too many countrys it's not that much work ;)
  • Options
    dramaticlookdramaticlook Member Posts: 5 Contributor II
    ;DActually what I presented here was a small example. I actually have a large dataset with 35 financial attributes and two of them lists the country codes and years in a similar structure. There are more than 150 country codes.
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Heya, I would iterate all countries with the Loop Values operator. In each iteration, it creates a macro, which you can use to filter the example set in the inner process with Filter Examples and the attribute_value filter. Then you can apply the Replace Missing Values operator on the filtered data, Append the output of Loop Values and you're done. There should be some examples on the usage of Loop Values here in the forums.

    Happy Mining!

    ~Marius
  • Options
    dramaticlookdramaticlook Member Posts: 5 Contributor II
    Hi, I just had a chance to apply your advice. I can loop through the country codes. However I can not use the attribute filter thing. Do I have to create a filter for attribute values of country codes? I mean there is a string parameter when I try to use Filter Examples inside the looper with the option attribute_value_filter. It reasons to type something like CountryCode=USA there and then use the replace missing values in the next operator. However I am going to need to create around 150 operators manually. How can I automate it?
    The first layer
    image
    The second layer
    image
  • Options
    Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    Hi,

    please post your process setup like it is described in Marius signature.

    Best,
    Nils
  • Options
    dramaticlookdramaticlook Member Posts: 5 Contributor II
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
       <process expanded="true" height="431" width="815">
         <operator activated="true" class="read_csv" compatibility="5.2.008" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
           <parameter key="csv_file" value="/Users/cantelk/Desktop/2012-FALL/CS553/project-data-cvs/oecd-reorg-dates.csv"/>
           <parameter key="column_separators" value=","/>
           <parameter key="date_format" value="yyyy-MM-dd"/>
           <parameter key="first_row_as_names" value="false"/>
           <list key="annotations">
             <parameter key="0" value="Name"/>
           </list>
           <parameter key="encoding" value="US-ASCII"/>
           <list key="data_set_meta_data_information">
             <parameter key="0" value="CountryCode.true.text.attribute"/>
             <parameter key="1" value="Year.true.date.attribute"/>
             <parameter key="2" value="OECD membership.true.binominal.attribute"/>
             <parameter key="3" value="Adjusted net national income (annual % growth).true.real.attribute"/>
             <parameter key="4" value="Adjusted net savings  excluding particulate emission damage (current US$).true.real.attribute"/>
             <parameter key="5" value="Adjusted net savings  including particulate emission damage (current US$).true.real.attribute"/>
             <parameter key="6" value="Agriculture  value added (annual % growth).true.real.attribute"/>
             <parameter key="7" value="Exports of goods and services (annual % growth).true.real.attribute"/>
             <parameter key="8" value="Final consumption expenditure  etc\. (annual % growth).true.real.attribute"/>
             <parameter key="9" value="GDP growth (annual %).true.real.attribute"/>
             <parameter key="10" value="GDP per capita growth (annual %).true.real.attribute"/>
             <parameter key="11" value="General government final consumption expenditure (annual % growth).true.real.attribute"/>
             <parameter key="12" value="GNI growth (annual %).true.real.attribute"/>
             <parameter key="13" value="GNI per capita growth (annual %).true.real.attribute"/>
             <parameter key="14" value="Gross capital formation (annual % growth).true.real.attribute"/>
             <parameter key="15" value="Gross fixed capital formation (annual % growth).true.real.attribute"/>
             <parameter key="16" value="Household final consumption expenditure (annual % growth).true.real.attribute"/>
             <parameter key="17" value="Household final consumption expenditure per capita growth (annual %).true.real.attribute"/>
             <parameter key="18" value="Household final consumption expenditure  etc\. (annual % growth).true.real.attribute"/>
             <parameter key="19" value="Imports of goods and services (annual % growth).true.real.attribute"/>
             <parameter key="20" value="Industry  value added (annual % growth).true.real.attribute"/>
             <parameter key="21" value="Manufacturing  value added (annual % growth).true.real.attribute"/>
             <parameter key="22" value="Services  etc\.  value added (annual % growth).true.real.attribute"/>
             <parameter key="23" value="Daily newspapers (per 1000 people).true.real.attribute"/>
             <parameter key="24" value="Fixed broadband Internet subscribers.true.integer.attribute"/>
             <parameter key="25" value="Internet users.true.real.attribute"/>
             <parameter key="26" value="Mobile cellular subscriptions.true.integer.attribute"/>
             <parameter key="27" value="Secure Internet servers.true.integer.attribute"/>
             <parameter key="28" value="Telephone lines.true.integer.attribute"/>
             <parameter key="29" value="Scientific and technical journal articles.true.real.attribute"/>
             <parameter key="30" value="Researchers in R&amp;D (per million people).true.real.attribute"/>
             <parameter key="31" value="Research and development expenditure (% of GDP).true.real.attribute"/>
             <parameter key="32" value="Patent applications  residents.true.integer.attribute"/>
             <parameter key="33" value="Patent applications  nonresidents.true.integer.attribute"/>
             <parameter key="34" value="High-technology exports (current US$).true.integer.attribute"/>
             <parameter key="35" value="High-technology exports (% of manufactured exports).true.real.attribute"/>
           </list>
         </operator>
         <operator activated="true" class="multiply" compatibility="5.2.008" expanded="true" height="94" name="Multiply (2)" width="90" x="246" y="30"/>
         <operator activated="true" class="set_role" compatibility="5.2.008" expanded="true" height="76" name="Set Role" width="90" x="380" y="165">
           <parameter key="name" value="OECD membership"/>
           <parameter key="target_role" value="label"/>
           <list key="set_additional_roles"/>
         </operator>
         <operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes (2)" width="90" x="514" y="165">
           <parameter key="attribute_filter_type" value="subset"/>
           <parameter key="attributes" value="|Telephone lines|Services  etc.  value added (annual % growth)|Secure Internet servers|Scientific and technical journal articles|Researchers in R&amp;D (per million people)|Research and development expenditure (% of GDP)|Patent applications  residents|Patent applications  nonresidents|OECD membership|Mobile cellular subscriptions|Manufacturing  value added (annual % growth)|Internet users|Industry  value added (annual % growth)|Imports of goods and services (annual % growth)|Household final consumption expenditure per capita growth (annual %)|Household final consumption expenditure (annual % growth)|Household final consumption expenditure  etc. (annual % growth)|High-technology exports (current US$)|High-technology exports (% of manufactured exports)|Gross fixed capital formation (annual % growth)|Gross capital formation (annual % growth)|General government final consumption expenditure (annual % growth)|GNI per capita growth (annual %)|GNI growth (annual %)|GDP per capita growth (annual %)|GDP growth (annual %)|Fixed broadband Internet subscribers|Final consumption expenditure  etc. (annual % growth)|Exports of goods and services (annual % growth)|Daily newspapers (per 1000 people)|Agriculture  value added (annual % growth)|Adjusted net savings  including particulate emission damage (current US$)|Adjusted net savings  excluding particulate emission damage (current US$)|Adjusted net national income (annual % growth)"/>
         </operator>
         <operator activated="true" class="parallel:decision_tree_parallel" compatibility="5.1.000" expanded="true" height="76" name="Decision Tree" width="90" x="648" y="165">
           <parameter key="criterion" value="accuracy"/>
         </operator>
         <operator activated="true" class="loop_values" compatibility="5.2.008" expanded="true" height="76" name="Loop Values" width="90" x="380" y="30">
           <parameter key="attribute" value="CountryCode"/>
           <process expanded="true" height="429" width="769">
             <operator activated="true" class="filter_examples" compatibility="5.2.008" expanded="true" height="76" name="Filter Examples" width="90" x="112" y="30">
               <parameter key="condition_class" value="attribute_value_filter"/>
               <parameter key="parameter_string" value="CountryCode=AFG"/>
             </operator>
             <operator activated="true" class="select_attributes" compatibility="5.2.008" expanded="true" height="76" name="Select Attributes" width="90" x="246" y="30">
               <parameter key="attribute_filter_type" value="subset"/>
               <parameter key="attributes" value="|Telephone lines|Services  etc.  value added (annual % growth)|Secure Internet servers|Scientific and technical journal articles|Researchers in R&amp;D (per million people)|Research and development expenditure (% of GDP)|Patent applications  residents|Patent applications  nonresidents|OECD membership|Mobile cellular subscriptions|Manufacturing  value added (annual % growth)|Internet users|Industry  value added (annual % growth)|Imports of goods and services (annual % growth)|Household final consumption expenditure per capita growth (annual %)|Household final consumption expenditure (annual % growth)|Household final consumption expenditure  etc. (annual % growth)|High-technology exports (current US$)|High-technology exports (% of manufactured exports)|Gross fixed capital formation (annual % growth)|Gross capital formation (annual % growth)|General government final consumption expenditure (annual % growth)|GNI per capita growth (annual %)|GNI growth (annual %)|GDP per capita growth (annual %)|GDP growth (annual %)|Fixed broadband Internet subscribers|Final consumption expenditure  etc. (annual % growth)|Exports of goods and services (annual % growth)|Daily newspapers (per 1000 people)|CountryCode|Agriculture  value added (annual % growth)|Adjusted net savings  including particulate emission damage (current US$)|Adjusted net savings  excluding particulate emission damage (current US$)|Adjusted net national income (annual % growth)"/>
             </operator>
             <operator activated="true" class="replace_missing_values" compatibility="5.2.008" expanded="true" height="94" name="Replace Missing Values" width="90" x="380" y="30">
               <parameter key="attribute_filter_type" value="value_type"/>
               <parameter key="block_type" value="value_series"/>
               <list key="columns"/>
             </operator>
             <connect from_port="example set" to_op="Filter Examples" to_port="example set input"/>
             <connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
             <connect from_op="Select Attributes" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
             <connect from_op="Replace Missing Values" from_port="example set output" to_port="out 1"/>
             <portSpacing port="source_example set" spacing="0"/>
             <portSpacing port="sink_out 1" spacing="0"/>
             <portSpacing port="sink_out 2" spacing="0"/>
           </process>
         </operator>
         <connect from_op="Read CSV" from_port="output" to_op="Multiply (2)" to_port="input"/>
         <connect from_op="Multiply (2)" from_port="output 1" to_op="Loop Values" to_port="example set"/>
         <connect from_op="Multiply (2)" from_port="output 2" to_op="Set Role" to_port="example set input"/>
         <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes (2)" to_port="example set input"/>
         <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
         <connect from_op="Decision Tree" from_port="model" to_port="result 2"/>
         <connect from_op="Loop Values" from_port="out 1" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>
    here is my process. I have an extra decision tree which you can omit. The problem is related to the loop values process which I use to replace the missing values with the average.
  • Options
    Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    You can use the macro provided by Loop Values and set the parameter of Filter Examples like this: 'CountryCode=%{loop_value}'

    Best,
    Nils
Sign In or Register to comment.