Options

Combine documents + weighting

simon_knollsimon_knoll Member Posts: 40 Contributor II
edited June 2019 in Help
Hello dear RM Team,
it would be a cool feature if the combine documents operator would have the capabillities to weight incoming documents (the terms of one document are more important then  others)

image

all the best,
simon

Answers

  • Options
    simon_knollsimon_knoll Member Posts: 40 Contributor II
    i worte a fast implementation for that on the combine documents operator sourcecode, which seems to be working, any comments?
    /*
    *  RapidMiner
    *
    *  Copyright (C) 2001-2009 by Rapid-I and the contributors
    *
    *  Complete list of developers available at our web site:
    *
    *      http://rapid-i.com
    *
    *  This program is free software: you can redistribute it and/or modify
    *  it under the terms of the GNU Affero General Public License as published by
    *  the Free Software Foundation, either version 3 of the License, or
    *  (at your option) any later version.
    *
    *  This program is distributed in the hope that it will be useful,
    *  but WITHOUT ANY WARRANTY; without even the implied warranty of
    *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    *  GNU Affero General Public License for more details.
    *
    *  You should have received a copy of the GNU Affero General Public License
    *  along with this program.  If not, see http://www.gnu.org/licenses/.
    */
    package com.rapidminer.operator.text.io.transformer;

    import java.util.ArrayList;
    import java.util.List;

    import com.rapidminer.operator.Operator;
    import com.rapidminer.operator.OperatorDescription;
    import com.rapidminer.operator.OperatorException;
    import com.rapidminer.operator.Value;
    import com.rapidminer.operator.ports.InputPortExtender;
    import com.rapidminer.operator.ports.OutputPort;
    import com.rapidminer.operator.text.Document;
    import com.rapidminer.operator.text.Token;

    /**
    * This operator combines serveral documents by appending their content to a new
    * document. The meta data will be added from all documents but the values of
    * the first documents will be overwritten by the values of the following.
    *
    * @author Tobias Malbrecht, Sebastian Land
    */
    public class CombineDocumentsOperator extends Operator {

    private InputPortExtender documentInputPorts = new InputPortExtender(
    "documents", getInputPorts());

    private OutputPort documentOutput = getOutputPorts().createPort("document");

    public CombineDocumentsOperator(OperatorDescription description) {
    super(description);
    documentInputPorts.start();
    getTransformer().addGenerationRule(documentOutput, Document.class);
    }

    @Override
    public void doWork() throws OperatorException {
    List<Document> documents = documentInputPorts.getData(true);

    List<Token> tokens = new ArrayList<Token>();
    Document result = new Document(tokens);
    //within this loop i observe the labelnames of the documents. if they entail a pattern like <label>_weigh_<weight>
    //i cast <weight> to float and i'm multiplying every token's weight with <weight>
    String[] splitted;
    for (Document document : documents) {
    String label = (String) document.getMetaDataValue("label");
    splitted = label.split("_weight_");
    if (splitted.length > 1) {
    List<Token> newSequence = new ArrayList<Token>();
    float weight = Float.parseFloat(splitted[1]);
    List<Token> tseq = document.getTokenSequence();
    for (Token token : tseq) {
    Token t = new Token(token.getToken(), token.getWeight()
    * weight);
    newSequence.add(t);
    System.out.println(t);
    }
    tokens.addAll(newSequence);
    } else {
    tokens.addAll(document.getTokenSequence());
    }

    //this line is just for beauty
    document.addMetaData("label", splitted[0],
    document.getMetaDataType("label"));

    result.addMetaData(document);
    }
    documentOutput.deliver(result);
    }
    }
  • Options
    fischerfischer Member Posts: 439 Maven
    Hi,

    we have thought about this and think it is a good idea in general. However, assuming that you have something like "label_weight_0.7" in the annotations looks a bit weird. We should at least have a weight meta data or something similar that does not require this parsing operation. How are you constructing this string in your case?

    Best,
    Simon
  • Options
    simon_knollsimon_knoll Member Posts: 40 Contributor II
    Hi Simon,
    doing the weighting within the label was the easiest way for me to integrate it in my program.
    Of which string are you talking about?

    if you are talking about the string for the label than it goes like that:

    first a bit context:
    i want to cluster webservices, and for that i have documents related to the service. as not every document has the same importance, i have to weight them.

    now how i build the label name:
    the prefix is allways the service id, then i have "_weight_" and then i have a weight value like 0.5
    e.g.: SMSService01_weight_0.5

    all the best,
    simon
  • Options
    fischerfischer Member Posts: 439 Maven
    Hi Simon,

    thanks for clarifying this. Aytually I was thinking about which operator you are using to construct these strings. Is it an RM operator or your own implementation?

    Do you agree that this concatenation of strings is not the most elegant solution if we want to incorportate it into the release?

    Best,
    Simon
  • Options
    simon_knollsimon_knoll Member Posts: 40 Contributor II
    Hi Simon,

    The string is not constructed by a rapidminer operator, but by my own code, where im setting the labelnames of create document operators.

    But i agree with you that for a release there should be a more elegant/general way. Maybe a metadata which can be set for every document as you mentioned in your previous post.

    This was just a quick n' dirty coding which fit into my own implementation. Nevertheless also i would appreciate, if this comes into a release, that one can handle this by metadata for instance.

    all the best,
    Simon

  • Options
    fischerfischer Member Posts: 439 Maven
    Hi,

    if you change that so we have an additional meta data field "weight" which always contains a number, I would copy that to the next release. What do you think?

    Best,
    Simon
  • Options
    simon_knollsimon_knoll Member Posts: 40 Contributor II
    Hi Simon,
    sorry for the late answer. I would  appreciate that that this feature comes to the next release.

    when does the next release will happen?

    all the best
    simon
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    we will include weighting into the next major release of the Text Extension. There are many ongoing changes beside this, so it might take some time.

    Greetings,
      Sebastian
Sign In or Register to comment.