# Applying attribute elimination to original data

christopher_sch
Member Posts:

**1**Contributor II have a large dataset that has been tokenized. Many of the token attributes capture identical information, so I need to eliminate some variables that have 100% correlation.

Because the dataset is large, I'd like to perform "Remove Correlated Attributes" on a sample, rather than the original, then apply the results from the sample back to the original (eliminating about 1,000 attributes from the original in the process).

What's the best way to do this? I've been messing around with the "Work on Subset" operator, but it seems to only want to pull the sample back without applying the attribute removal to the original.

Thanks for any insight.

0

## Answers

2,959Community ManagerHello @christopher_sch - welcome to the community. Seems to me that you should try optimizing via Feature Selection. There are some nice tutorials on how to do this in those operators.

Scott

270UnicornIf I understood your question correctly, you want:

1) take a sample of the entire dataset.

2) find variables that are highly correlated

3) drop them

4) save the names of the variables that survived step 3

5) load the entire dataset

6) take only variables in step 4

I think you can do that with a combination of "Remove Correlated Attributes" and "Data to Weights".

In the example below, I split the sample dataset Sonar in two:

a) first 50 obs

b) obs 51 to 208

I use the first 50 obs to find correlated attributes (correlation > 0.7) and drop one of each pair. I save the weights of the variables that remained in the dataset (using Data to Weights). I then use these weights to filter the second part of the dataset.

The program of course could be split into two processes:

1) Find the weights and save them

2) Apply weights to entire dataset.

Program Below