[SOLVED] Set operations with mismatched ID attributes

Jester87Jester87 Member Posts: 10 Contributor II
edited November 2018 in Help
Here's my situation:

I am trying to create a time series based upon collection of data from about 90 different servers. I am attempting to track changes in inventory levels where inventory is stored as a list of words (eg. Apple, Apple, Pear, Apple, Kiwi, Mango). The basic idea is as follows:

Process:

1. Grab information using Web API key - > Save as xml.
2. Use the "Process Documents," "Cut Document" and "Extract Information" operators to create a word list (inventory list and count)
3. Use "Word List to Data" operator to create an example set.

Result

1. An example set with "word" as ID attribute and total as a regular attribute.

Here is an example of the kind of data that I expect to get after I use "Process Documents":

Parse #1 - (time 1)

Apples, 3
Oranges, 2
Bananas, 2

Parse #2 - (time 2)

Apples, 1
Oranges, 5
Bananas, 3
Kiwis, 7

(I have made it to this point successfully)

In order to track changes in inventory, I need to merge the results. As you can see, the problem with merging these two example sets is that the ID ("word") attribute column does not match. Parse #2 contains 7 Kiwi ("word") entries that parse #1 did not have. My problem is that I don't know how many different items there are, and therefore suspect that I will need to keep adding to my base ID attribute when I try to merge the parsed data into one spreadsheet or database.

One way to do this is to isolate the new ID attributes and run some set operations. I envision it would go something like this:

1. Load base ID attribute & new ID attribute
2. Use a "Set Minus" operator to find new ID attribute entries
3. Use a "Join" operator to merge Base ID attribute and new ID attribute entries
3a. Replace Base ID attribute
4. Use a "Join" operator to merge Base ID attribute and new ID attribute entries
4a. Replace new ID attribute
5. Use "Replace Missing Values" operator to make sure both new and old data align (e.g. Parse #1 had zero kiwi)
6. Use an "Append" operator to merge the data that should now have identical ID attribute.

I think this should work, but it seems convoluted. Do you understand what I am trying to do? Is there an easier way to do this?

Your help is much appreciated!

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    If I understand you correctly, your final data set should look like this:

    ID, T1, T2
    Apples, 3, 1
    Oranges, 2, 5
    Bananas, 2, 3
    Kiwis, 0, 7

    That can be done with the Join operator (mode: outer join), followed by a replace missing values. Probably you have to rename the attributes of one of the examplesets beforehand.

    Please let me know if I understood you correctly and if this works. If not, please define "merge the example sets" and give an example of the desired output.

    Best regards,
    Marius
  • Jester87Jester87 Member Posts: 10 Contributor II
    Marius wrote:

    If I understand you correctly, your final data set should look like this:

    ID, T1, T2
    Apples, 3, 1
    Oranges, 2, 5
    Bananas, 2, 3
    Kiwis, 0, 7

    That can be done with the Join operator (mode: outer join), followed by a replace missing values. Probably you have to rename the attributes of one of the examplesets beforehand.

    Please let me know if I understood you correctly and if this works. If not, please define "merge the example sets" and give an example of the desired output.

    Best regards,
    Marius
    You have the right idea, but doesn't the key attribute (ID role) need to be identical for each example set that is being joined?

    In my particular case, my second example set added Kiwi as part of the key attribute. Since the first example set doesn't include Kiwi, don't I need to merge the key attributes of the example sets before I can use the join operator?
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Ask less, experiment more - RapidMiner is a more powerful wizard than you might think :)

    But seriously, just give things a try - in this case, the join operator does the job quite well, as you can also see in the attached process. Please note that I deactivated the option remove_duplicate_attributes.

    Best regards,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.006">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.006" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="5.3.006" expanded="true" height="60" name="Generate Data" width="90" x="112" y="30">
            <parameter key="number_examples" value="10"/>
            <parameter key="number_of_attributes" value="1"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="5.3.006" expanded="true" height="76" name="Generate ID" width="90" x="313" y="30"/>
          <operator activated="true" class="generate_data" compatibility="5.3.006" expanded="true" height="60" name="Generate Data (2)" width="90" x="112" y="120">
            <parameter key="number_examples" value="10"/>
            <parameter key="number_of_attributes" value="1"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="5.3.006" expanded="true" height="76" name="Generate ID (2)" width="90" x="313" y="120">
            <parameter key="offset" value="5"/>
          </operator>
          <operator activated="true" class="join" compatibility="5.3.006" expanded="true" height="76" name="Join" width="90" x="447" y="75">
            <parameter key="remove_double_attributes" value="false"/>
            <parameter key="join_type" value="outer"/>
            <list key="key_attributes"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Join" to_port="left"/>
          <connect from_op="Generate Data (2)" from_port="output" to_op="Generate ID (2)" to_port="example set input"/>
          <connect from_op="Generate ID (2)" from_port="example set output" to_op="Join" to_port="right"/>
          <connect from_op="Join" from_port="join" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • Jester87Jester87 Member Posts: 10 Contributor II
    Marius wrote:

    Ask less, experiment more - RapidMiner is a more powerful wizard than you might think :)

    But seriously, just give things a try - in this case, the join operator does the job quite well, as you can also see in the attached process. Please note that I deactivated the option remove_duplicate_attributes.

    Best regards,
    Marius
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.006">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.006" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data" compatibility="5.3.006" expanded="true" height="60" name="Generate Data" width="90" x="112" y="30">
            <parameter key="number_examples" value="10"/>
            <parameter key="number_of_attributes" value="1"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="5.3.006" expanded="true" height="76" name="Generate ID" width="90" x="313" y="30"/>
          <operator activated="true" class="generate_data" compatibility="5.3.006" expanded="true" height="60" name="Generate Data (2)" width="90" x="112" y="120">
            <parameter key="number_examples" value="10"/>
            <parameter key="number_of_attributes" value="1"/>
          </operator>
          <operator activated="true" class="generate_id" compatibility="5.3.006" expanded="true" height="76" name="Generate ID (2)" width="90" x="313" y="120">
            <parameter key="offset" value="5"/>
          </operator>
          <operator activated="true" class="join" compatibility="5.3.006" expanded="true" height="76" name="Join" width="90" x="447" y="75">
            <parameter key="remove_double_attributes" value="false"/>
            <parameter key="join_type" value="outer"/>
            <list key="key_attributes"/>
          </operator>
          <connect from_op="Generate Data" from_port="output" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Join" to_port="left"/>
          <connect from_op="Generate Data (2)" from_port="output" to_op="Generate ID (2)" to_port="example set input"/>
          <connect from_op="Generate ID (2)" from_port="example set output" to_op="Join" to_port="right"/>
          <connect from_op="Join" from_port="join" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Thanks, this worked like a charm. I'm not quite sure what caused my join operator to malfunction the first time I tried it, but now it seems to work no problems!
Sign In or Register to comment.