Options

Integration of Process Documents operator

KarahedraKarahedra Member Posts: 6 Contributor II
edited November 2018 in Help
Hello
i'm trying to use a process documents operator in a java application, the previous operators deliver data to the input ports correctly but i can't obtain any processed data from it.

Here is the code i used to initialize and operate it

processer = OperatorService.createOperator("process_documents");
processer.setEnabled(true);
processer.setParameter("vector_creation", "Term Occurrences");
processer.setParameter("create_word_vector", "true");
        processer.setParameter("add_meta_information", "true");
processer.setParameter("keep_text", "false");

// this section is repeated several times in the actual code     
        filter.getOutputPorts().getPortByIndex(0).connectTo(processer.getInputPorts().getPortByIndex(incounter));
filter.doWork();
        processer.getInputPorts().getPortByIndex(incounter).receive(filter.getOutputPorts().getPortByIndex(0).getAnyDataOrNull());
//end of loop

        processer.getOutputPorts().getPortByIndex(1).connectTo(transformer.getInputPorts().getPortByIndex(0));
        processer.doWork();
       
I'll apreciate any intervention able to reduce my enormous newbieness, thanks to everyone.
Andrea

Answers

  • Options
    haddockhaddock Member Posts: 849 Maven
    Greets Andrea,

    The answer depends on what you are trying to do, if you want to make an operator that you can integrate with the RM IDE then you need to either take a look at the source of an existing extension, or buy the white paper. If that is what you want to do then you will see that you need to explicitly deliver output to the ports, like this...
    @Override
    public void doWork() throws OperatorException {
    H_data input = hDataInput.getData();
    decode(input);
    hDataOutput.deliver(hDataInput.getData());
    }
    If on the other hand you want to embed RM in another application then you can barbarise the code as you see fit.

    Good luck!
  • Options
    KarahedraKarahedra Member Posts: 6 Contributor II
    Hello
    i'm trying to embed RM in another application, but the operator i'm using (process documents, from the text processing extension) doesn't seem to be working correctly, instead of delivering to the output port a word vector obtained from the documents i feed it, i obtain only an empty vector.
    I think that i'm missing some initialization step so it actually doesn't process anything, but i can't figure out which one...
    Barbaric code is something i produce with a decent bit of enthusiasm, but when my abuses stop giving acceptable results i tend to go back to the experts begging for some advice :)
  • Options
    haddockhaddock Member Posts: 849 Maven
    Hi again,

    As I remember it the 'process documents' operator needs inner operators to do the dirty work, like tokenizing and stemming; if you make a process in RM where there are no inner operators, you guessed it... zippo comes back. For example, if I run the following I get some data back ( just by connecting the inner input to the inner output directly, so just passing through ).
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
        <process expanded="true" height="399" width="886">
          <operator activated="true" class="read_database" compatibility="5.0.8" expanded="true" height="60" name="Read Database" width="90" x="63" y="24">
            <parameter key="connection" value="DellBoy"/>
            <parameter key="query" value="SELECT &quot;Content&quot;, &quot;Link&quot;&#13;&#10;FROM &quot;RSS&quot; where &quot;Content&quot; is not NULL"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.0.8" expanded="true" height="76" name="Set Role" width="90" x="246" y="30">
            <parameter key="name" value="Link"/>
            <parameter key="target_role" value="id"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.0.5" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <list key="specify_weights"/>
            <process expanded="true" height="399" width="886">
              <connect from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Database" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    whereas when that inner operator is not connected for pass through nothing comes back, i.e when it is like this...
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
        <process expanded="true" height="399" width="886">
          <operator activated="true" class="read_database" compatibility="5.0.8" expanded="true" height="60" name="Read Database" width="90" x="63" y="24">
            <parameter key="connection" value="DellBoy"/>
            <parameter key="query" value="SELECT &quot;Content&quot;, &quot;Link&quot;&#13;&#10;FROM &quot;RSS&quot; where &quot;Content&quot; is not NULL"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="5.0.8" expanded="true" height="76" name="Set Role" width="90" x="246" y="30">
            <parameter key="name" value="Link"/>
            <parameter key="target_role" value="id"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.0.5" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
            <parameter key="vector_creation" value="Term Occurrences"/>
            <list key="specify_weights"/>
            <process expanded="true" height="399" width="886">
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Database" from_port="output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    Which I guess means you might be better off calling the inner operator directly?

  • Options
    KarahedraKarahedra Member Posts: 6 Contributor II
    Yes, i think you're right and that should have been the step i was missing, but i haven't found a way to access the inner operators or perform some kind of wiring inside the process documents through java code...
    Again, thanks for the assistance and for the quick answers
  • Options
    landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    to add operators to so called super operators, use the following code fragment.
    		OperatorChain superOperator;
    superOperator.getSubprocess(0).addOperator(operator);
    Anyway I would suggest taking a look at the API documentation, from where this could have been comprehended.


    Greetings,
      Sebastian
Sign In or Register to comment.