Options

Get the index of every word start index and end index in a sentence

Teja_VaranasiTeja_Varanasi Member Posts: 17 Contributor II
Hi, i am trying to generate 2 new columns that gives start and end index of every word in a sentence and there is only 1 sentence.
Ex:           This is an apple
                 |    | |   |     |     |
index:       0  3 5  8   11   15

Word | Start | End
This   |  0     |  3
is       |  5     |   6
an      |  8     |   9
apple |   11   |   15


Can anyone please help me here.

Answers

  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi!

    You could use the Split operator with a space (or a more complex regular expression for separating the words) to put each word into its own attribute.
    Then Loop Attributes to work on each attribute, determining its start and end position based on the length. Inside the loop you would use macros to keep track of the position for example.

    Here is a solution:
    <?xml version="1.0" encoding="UTF-8"?><process version="9.10.013">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.10.013" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.10.013" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="text&#10;This is an apple."/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="split" compatibility="9.10.013" expanded="true" height="82" name="Split" width="90" x="246" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="split_pattern" value=" "/>
            <parameter key="split_mode" value="ordered_split"/>
          </operator>
          <operator activated="true" class="set_macro" compatibility="9.10.013" expanded="true" height="82" name="Set Macro" width="90" x="380" y="34">
            <parameter key="macro" value="counter"/>
            <parameter key="value" value="0"/>
          </operator>
          <operator activated="true" class="concurrency:loop_attributes" compatibility="9.10.013" expanded="true" height="103" name="Loop Attributes" width="90" x="514" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="attribute_name_macro" value="attr"/>
            <parameter key="reuse_results" value="false"/>
            <parameter key="enable_parallel_execution" value="false"/>
            <process expanded="true">
              <operator activated="true" class="extract_macro" compatibility="9.10.013" expanded="true" height="68" name="Extract Macro" width="90" x="45" y="34">
                <parameter key="macro" value="word"/>
                <parameter key="macro_type" value="data_value"/>
                <parameter key="statistics" value="average"/>
                <parameter key="attribute_name" value="%{attr}"/>
                <parameter key="example_index" value="1"/>
                <list key="additional_macros"/>
              </operator>
              <operator activated="true" class="generate_macro" compatibility="9.10.013" expanded="true" height="82" name="Calculate end" width="90" x="179" y="34">
                <list key="function_descriptions">
                  <parameter key="end" value="eval(%{counter}) + length(%{word}) - 1"/>
                </list>
              </operator>
              <operator activated="true" class="utility:create_exampleset" compatibility="9.10.013" expanded="true" height="68" name="Create ExampleSet (2)" width="90" x="313" y="136">
                <parameter key="generator_type" value="comma separated text"/>
                <parameter key="number_of_examples" value="100"/>
                <parameter key="use_stepsize" value="false"/>
                <list key="function_descriptions"/>
                <parameter key="add_id_attribute" value="false"/>
                <list key="numeric_series_configuration"/>
                <list key="date_series_configuration"/>
                <list key="date_series_configuration (interval)"/>
                <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
                <parameter key="time_zone" value="SYSTEM"/>
                <parameter key="input_csv_text" value="Word,Start,End&#10;%{word},%{counter},%{end}"/>
                <parameter key="column_separator" value=","/>
                <parameter key="parse_all_as_nominal" value="false"/>
                <parameter key="decimal_point_character" value="."/>
                <parameter key="trim_attribute_names" value="true"/>
              </operator>
              <operator activated="true" class="generate_macro" compatibility="9.10.013" expanded="true" height="82" name="Count to the next word" width="90" x="514" y="136">
                <list key="function_descriptions">
                  <parameter key="counter" value="eval(%{end}) + 2"/>
                </list>
              </operator>
              <connect from_port="input 1" to_op="Extract Macro" to_port="example set"/>
              <connect from_op="Extract Macro" from_port="example set" to_op="Calculate end" to_port="through 1"/>
              <connect from_op="Calculate end" from_port="through 1" to_port="output 1"/>
              <connect from_op="Create ExampleSet (2)" from_port="output" to_op="Count to the next word" to_port="through 1"/>
              <connect from_op="Count to the next word" from_port="through 1" to_port="output 2"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
              <portSpacing port="sink_output 3" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="9.10.013" expanded="true" height="82" name="Append" width="90" x="648" y="85">
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="merge_type" value="all"/>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="Split" to_port="example set input"/>
          <connect from_op="Split" from_port="example set output" to_op="Set Macro" to_port="through 1"/>
          <connect from_op="Set Macro" from_port="through 1" to_op="Loop Attributes" to_port="input 1"/>
          <connect from_op="Loop Attributes" from_port="output 2" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>



    This is a very limited solution. It expects the separator to be one character. You should remove characters that you don't want to count (e. g. the dot at the end of the sentence) before putting data into this process.

    You should be able to take it from here.

    Regards,

    Balázs
  • Options
    Teja_VaranasiTeja_Varanasi Member Posts: 17 Contributor II
    Hi, thank you. your solution is awesome. Now, actually what i am doing is i am sending the example set to NLP tagger. I want to get the start index and end of each word there. but it is kind of difficult can u please help there
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    I don't understand your question. Which kind of system is this, what input does it expect? 

    You already have the solution for getting the start and end index of the words. Do you need to pass those? 

    Regards,
    Balázs
  • Options
    Teja_VaranasiTeja_Varanasi Member Posts: 17 Contributor II
    I am working with NER model. so i used NLP tagger. i want to get the NLP tagger result and along with that start index and end index along with result
  • Options
    rdesairdesai Employee, RMResearcher, Member Posts: 15 RM Research
    Hi Teja, unfortunately the NLP Tagger doesn't have that functionality implemented inside the operator but if you like, you can write a short python script calling a function that can retrieve word indices. Hope that helps! 
  • Options
    Mia_SmithMia_Smith Member Posts: 6 Contributor I
    sentence = "This is an apple"
    start = 0
    for word in sentence.split():
        end = start + len(word) - 1
        print(word, start, end)
        start = end + 2

    This will give you the start and end indices for each word.




Sign In or Register to comment.