Applying an operation to a large example set

mikebmikeb Member Posts: 5 Contributor II
edited November 2018 in Help
Hi,
I have an example set with 10,000 examples and 3,800 attributes.  These are document file names and the TF-IDF values for 3800 terms in those documents.  I want to raise each TF-IDF value by the power of 0.75.  Is there a simple, fast way to do this?

What I have tried is looping through each of the attributes and generating a new attribute that is the TF-IDF value raised by the power of 0.75, then looping through the resulting collection and using recall, join, and remember operators to join each collection example to the previous ones as I iterate through the loop.  The problem is that this slows down and eventually stalls out or crashes as the iterations increase and the joined example set gets larger and larger.  So I am wondering if there is some more efficient way to do the (seemingly) simple thing of applying one operation like this to every value in the example set.

I should also mention that I looked at the Generate Function Set operator.  This looks like what I want, except that the specific operation I want to do is not included as one of the choices in that operator.

Thanks in advance for your help.

Answers

  • awchisholmawchisholm RapidMiner Certified Expert, Member Posts: 458 Unicorn
    Hello mikeb

    Groovy is the answer. Use the Script operator with this code.
    ExampleSet exampleSet = operator.getInput(ExampleSet.class);

    for (Attribute attribute : exampleSet.getAttributes()) {
        String name = attribute.getName();
        for (Example example : exampleSet) {
            example[name] = (example[name])**0.75;
        }
    }

    return exampleSet;
    I did an experiment with 10,000 examples by 3,800 attributes and it took 2 minutes on my laptop. Obviously other's results may vary :)

    regards

    Andrew
  • mikebmikeb Member Posts: 5 Contributor II
    Hi awchisholm,
    Thanks!  I think that will work for me.
    mikeb
Sign In or Register to comment.