"How to store (within-process) feature-selected attributes?"

alimay · May 2011

Hi.. My question is simple and even though i think a lot of people must've faced it, i could not find a solution on the web. Here it is:

I have a file which after importing I split into two using the split operator. Then on the first partition I do a feature selection and model construction, and I keep the second partition for test set purposes. Now the thing is: I need a way to remember/store the feature-selected attributes so that I can select them by using select attributes operator, when I apply my model to the test, so that I do not get the error:

May 1, 2011 6:03:32 AM WARNING: SimpleDistribution: The number of regular attributes of the given example set does not fit the number of attributes of the training example set, training: 50, application: 3045

So what I need is something like a "group attributes" operator, that will group the attributes coming from the feature selection operator I use (and give it a user-defined name like "grouped attributes"). Later when I use select attributes operator on the test set, it should display, in addition to all regular attributes and class attibute, this "grouped attributes" so that I can select it.. Otherwise I need to note all the attributes coming from the feature selection and select them one by one, which is a terrible experience if you're trying to select the best few hundred attributes.

Is there already an operator doing this? I failed to see.
I hope my question is clear. Thank you in advance for your answers.

haddock · May 2011

Hi there,

In similar circumstances I use the attribute weights that the thinning operator produces, either diectly or read them back from file, and then apply them to another dataset. Should work.

Hope so!

alimay · May 2011

Hi!

Thanks for the answer. However it does not solve my problem as I split the data (into two partitions) within the process, not before it. And since read constructions on the second partition is performed before write constructions after the feature selection on the first partition, that does not work for me. And unfortunatelly I am not sure whether I understood fully what you meant by "directly". How can I use the attributes selected from the first partition directly to filter the attributes on the second partiton?

haddock · May 2011

Hi there,

Here's a simplified example of what I had In mind, hope it helps!

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.006">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.006" expanded="true" name="Root">
    <process expanded="true" height="604" width="614">
      <operator activated="true" class="retrieve" compatibility="5.1.006" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="../../data/Polynomial"/>
      </operator>
      <operator activated="true" class="split_data" compatibility="5.1.006" expanded="true" height="94" name="Split Data" width="90" x="179" y="120">
        <enumeration key="partitions">
          <parameter key="ratio" value="0.7"/>
          <parameter key="ratio" value="0.3"/>
        </enumeration>
      </operator>
      <operator activated="true" class="weight_by_relief" compatibility="5.1.006" expanded="true" height="76" name="Relief" width="90" x="313" y="30"/>
      <operator activated="true" class="select_by_weights" compatibility="5.1.006" expanded="true" height="94" name="AttributeWeightSelection" width="90" x="447" y="30">
        <parameter key="weight" value="0.5"/>
        <parameter key="use_absolute_weights" value="false"/>
      </operator>
      <operator activated="true" class="select_by_weights" compatibility="5.1.006" expanded="true" height="94" name="AttributeWeightSelection (2)" width="90" x="514" y="165">
        <parameter key="weight" value="0.5"/>
        <parameter key="use_absolute_weights" value="false"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Split Data" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 1" to_op="Relief" to_port="example set"/>
      <connect from_op="Split Data" from_port="partition 2" to_op="AttributeWeightSelection (2)" to_port="example set input"/>
      <connect from_op="Relief" from_port="weights" to_op="AttributeWeightSelection" to_port="weights"/>
      <connect from_op="Relief" from_port="example set" to_op="AttributeWeightSelection" to_port="example set input"/>
      <connect from_op="AttributeWeightSelection" from_port="example set output" to_port="result 1"/>
      <connect from_op="AttributeWeightSelection" from_port="weights" to_op="AttributeWeightSelection (2)" to_port="weights"/>
      <connect from_op="AttributeWeightSelection (2)" from_port="example set output" to_port="result 2"/>
      <connect from_op="AttributeWeightSelection (2)" from_port="weights" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

alimay · May 2011

Ah, now I understand what you mean, and I think this really is what I need. Thanks a lot.

Only a little concern: I don't really know whether it is okay to (or maybe it is the norm, I really don't know) pass the attribute weights by which the model is trained to the test set. I mean, as far as I understand by what you just supplied we are passing attributes along with their weights, and apply the model on these attributes with weights coming from the feature selection. Is this a good/normal thing? Data mining/machine learning is like a new topic to me, that's why I am not really confident in these sort of situations, that's why I ask. Forgive if it's a nonsense question..

Sincere thanks..

haddock · May 2011

Forgive if it's a nonsense question..

NO, this is not a nonsense question - it is just smart. There is a sort of 'lie' that underpins datamining in an unspoken way, namely that there is a 'right' answer. There isn't, and in fact, there never can be. All swans are white, yep, until you go to Western Australia and find a black one, or to X where their colour is Y ....

On the other hand there are dumb answers, " all swans are not fish"...

I'm afraid that until someone provides a post-Wittgenstein notion of meaning we are just left with..

"Whatever floats your boat..."

Or am I missing something? If only..

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"How to store (within-process) feature-selected attributes?"

Answers