Options

strange behavior of replace tokens operator

simon_knollsimon_knoll Member Posts: 40 Contributor II
edited November 2018 in Help
hello all,

im having a workflow containing a create document operator and a process documents operator.
the process documents operator entails a tokenizer and a replace tokens operator.
the replace tokens operator has following rules.

replace est with Eastern_Time
replace dup with duplicates
and hello with hallo

the process documents vector creation is set to term occourences.

the create documents text is :

est
dup
hello

the created wordvector eintails now
Eastern_Time
duplicate
hallo

and now comes the strange thing:
Eastern_Time and duplicate have occourence 0 and hallo has occourence 1

i expected a vector where every of the terms has occourence 1

if im exchanging the process documents operator with the process documents from files operator and i write the words 

est
dup
hello

in a text file i get the expected beavior with a vector entailing

Eastern_Time
duplicate
hallo

and every term has an occourence of 1

is this a bug?
am i doing something wrong?

all the best
simon

ps: here the workflow with read document
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.0" expanded="true" name="Process">
    <process expanded="true" height="811" width="435">
      <operator activated="true" class="text:create_document" compatibility="5.0.6" expanded="true" height="60" name="Create Document (8)" width="90" x="45" y="30">
        <parameter key="text" value="est&#10;dup&#10;hello"/>
        <parameter key="label_value" value="jmol"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.0.6" expanded="true" height="94" name="Process Documents (3)" width="90" x="315" y="30">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <parameter key="datamanagement" value="double_array"/>
        <process expanded="true" height="811" width="1068">
          <operator activated="true" class="text:tokenize" compatibility="5.0.7" expanded="true" height="60" name="Tokenize" width="90" x="246" y="30"/>
          <operator activated="true" class="text:replace_tokens" compatibility="5.0.6" expanded="true" height="60" name="Replace Tokens" width="90" x="514" y="30">
            <list key="replace_dictionary">
              <parameter key="est" value="Eastern_Time"/>
              <parameter key="dup" value="duplicate"/>
              <parameter key="hello" value="hallo"/>
            </list>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Replace Tokens" to_port="document"/>
          <connect from_op="Replace Tokens" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document (8)" from_port="output" to_op="Process Documents (3)" to_port="documents 1"/>
      <connect from_op="Process Documents (3)" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="90"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Answers

  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    thanks for this detailed report. I have found the problem: The Documents delivered to the input ports were directly delivered to the inner process. Since the inner process is passed twice by each document, they where tokenized and replaced two times. Make a break point before the tokenize operator to see this effect.
    I have corrected this, it will be delivered with the next regular update.

    Greetings,
      Sebastian
  • Options
    simon_knollsimon_knoll Member Posts: 40 Contributor II
    hi sebastian,
    i was searching today  for a workaround and i tried this within the DocumentTextInputOperator:
    @Override
    protected Iterator<Document> getTextObjects() {
    List<Document> documents = documentInput.getData(true);
    ArrayList<Document> the_documents = new ArrayList<Document>();
    for (Document document : documents) {
    Document myDocument = new Document(document.getText());
    myDocument.addMetaData(document);
    the_documents.add(myDocument);
    }
    return the_documents.iterator();
    }
    at first sight it worked out, but i think if im doing like that, im messing it up, do you have an advice for a hotfix, as i need this feature really urgent :)

    all the best,
    simon
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    try using this:
    	@Override
    protected Iterator<Document> getTextObjects() {
    List<Document> documents = documentInput.getData(true);
    List<Document> clonedDocuments = new ArrayList<Document>(documents.size());
    for (Document document: documents) {
    clonedDocuments.add(new Document(document.getTokenSequence(), document));
    }
    return clonedDocuments.iterator();
    }
    Think about getting enterprise customer, then you already would have a new release :)

    Greetings,
      Sebastian
  • Options
    simon_knollsimon_knoll Member Posts: 40 Contributor II
    Hi Sebastian
    Thank you!!!
    i'll give a try.

    all the best, simon
Sign In or Register to comment.