Is it possible to Extract definitions of the same concept on different papers?

AndreriwAndreriw Member Posts: 4 Contributor I
edited August 2021 in Help
I was thinking:

1. on word like documents 
2. search key word "X"
3. extract the text with the  main words associated to the key word 
4. Is there a template here?

thank you

Answers

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    It is possible. There is an extension named "Text Mining" that allows you to read Word documents.

    Also in this thread, I created a building block for such a task:

    Import a Word document to Rapidminer — RapidMiner Community

    The rest is to play along with transform case, tokenize and a little more search, but it's doable. And while I think there is a template for basic NLP, it's not complete, I think.
  • MarcoBarradasMarcoBarradas Administrator, Employee, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
    Hi @Andreriw,

    Adding to what Rodrigo shared you can use this basic process to continue your analysis.

    You can also check the new CoreNLP extension. To extract some entities you can also extend it with your own entities.

    https://youtu.be/XvGUFOU1vcI


    <?xml version="1.0" encoding="UTF-8"?><process version="9.10.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.10.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.10.000" expanded="true" height="68" name="Texts" width="90" x="45" y="34">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="text&#10;Among the best-known birds are the birds of prey, such as hawks, eagles, ospreys, falcons, and owls. They have hooked beaks, strong talons or claws on their feet, and keen eyesight and hearing. ----. Ospreys and many eagles eat fish, falcons eat mostly insects, and owls eat everything from insects to fish and mammals.&#10;The jaguar Is sometimes called Americans El Tigre by South and Central Americans. ----. Both names convey the awe and reverence this largest New World cat inspires. Their gold coat spangled with black rosettes was said to be the stars of night. In the Mayan religion, the sun took the form of a jaguar when traveling through the underworld at night.&#10;Monarch butterflies travel long distances to stay warm. They fly up to 3,000 miles to the same winter roosts, sometimes to the exact same trees. However, their life span is only a few months. ----. Their great-great-grandchildren return south the following fall."/>
            <parameter key="column_separator" value=";"/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="9.10.000" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
            <parameter key="attribute_filter_type" value="all"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
          </operator>
          <operator activated="true" class="remember" compatibility="9.10.000" expanded="true" height="68" name="Remember" width="90" x="380" y="34">
            <parameter key="name" value="DS"/>
            <parameter key="io_object" value="ExampleSet"/>
            <parameter key="store_which" value="1"/>
            <parameter key="remove_from_process" value="true"/>
          </operator>
          <operator activated="true" class="utility:create_exampleset" compatibility="9.10.000" expanded="true" height="68" name="Key Words" width="90" x="45" y="187">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="word&#10;Birds&#10;Jaguar&#10;Monarch Butterflies"/>
            <parameter key="column_separator" value=";"/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="loop_examples" compatibility="9.10.000" expanded="true" height="103" name="Loop Examples" width="90" x="313" y="187">
            <parameter key="iteration_macro" value="example"/>
            <process expanded="true">
              <operator activated="true" class="extract_macro" compatibility="9.10.000" expanded="true" height="68" name="Extract Macro" width="90" x="45" y="34">
                <parameter key="macro" value="word"/>
                <parameter key="macro_type" value="data_value"/>
                <parameter key="statistics" value="average"/>
                <parameter key="attribute_name" value="word"/>
                <parameter key="example_index" value="%{example}"/>
                <list key="additional_macros"/>
              </operator>
              <operator activated="true" class="recall" compatibility="9.10.000" expanded="true" height="68" name="Recall" width="90" x="179" y="85">
                <parameter key="name" value="DS"/>
                <parameter key="io_object" value="ExampleSet"/>
                <parameter key="remove_from_store" value="false"/>
              </operator>
              <operator activated="true" class="text:process_document_from_data" compatibility="9.3.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="380" y="34">
                <parameter key="create_word_vector" value="false"/>
                <parameter key="vector_creation" value="TF-IDF"/>
                <parameter key="add_meta_information" value="true"/>
                <parameter key="keep_text" value="true"/>
                <parameter key="prune_method" value="none"/>
                <parameter key="prune_below_percent" value="3.0"/>
                <parameter key="prune_above_percent" value="30.0"/>
                <parameter key="prune_below_rank" value="0.05"/>
                <parameter key="prune_above_rank" value="0.95"/>
                <parameter key="datamanagement" value="double_sparse_array"/>
                <parameter key="data_management" value="auto"/>
                <parameter key="select_attributes_and_weights" value="false"/>
                <list key="specify_weights"/>
                <process expanded="true">
                  <operator activated="true" class="text:tokenize" compatibility="9.3.001" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34">
                    <parameter key="mode" value="linguistic sentences"/>
                    <parameter key="characters" value=".:"/>
                    <parameter key="language" value="English"/>
                    <parameter key="max_token_length" value="3"/>
                    <description align="center" color="yellow" colored="true" width="126">Tokenize each sentence</description>
                  </operator>
                  <operator activated="true" class="text:filter_tokens_by_content" compatibility="9.3.001" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="380" y="34">
                    <parameter key="condition" value="contains"/>
                    <parameter key="string" value="%{word}"/>
                    <parameter key="case_sensitive" value="false"/>
                    <parameter key="invert condition" value="false"/>
                    <description align="center" color="yellow" colored="true" width="126">Here you use the key word you are using as a filter</description>
                  </operator>
                  <operator activated="true" class="text:extract_length" compatibility="9.3.001" expanded="true" height="68" name="Extract Length" width="90" x="514" y="34">
                    <parameter key="metadata_key" value="document_length"/>
                  </operator>
                  <connect from_port="document" to_op="Tokenize" to_port="document"/>
                  <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
                  <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Extract Length" to_port="document"/>
                  <connect from_op="Extract Length" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_document" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
                <description align="center" color="yellow" colored="true" width="126">Inside we tokenize into sentences, filter the ones that have our key word and we extract the leght (how many letters we have on each row)</description>
              </operator>
              <operator activated="true" class="generate_attributes" compatibility="9.10.000" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34">
                <list key="function_descriptions">
                  <parameter key="keyword" value="%{word}"/>
                </list>
                <parameter key="keep_all" value="true"/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="9.10.000" expanded="true" height="103" name="Filter Examples" width="90" x="715" y="34">
                <parameter key="parameter_expression" value=""/>
                <parameter key="condition_class" value="custom_filters"/>
                <parameter key="invert_filter" value="false"/>
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="document_length.ne.0"/>
                </list>
                <parameter key="filters_logic_and" value="true"/>
                <parameter key="filters_check_metadata" value="true"/>
                <description align="center" color="yellow" colored="true" width="126">We keep only the rows that have sentences with our keyword</description>
              </operator>
              <connect from_port="example set" to_op="Extract Macro" to_port="example set"/>
              <connect from_op="Recall" from_port="result" to_op="Process Documents from Data" to_port="example set"/>
              <connect from_op="Process Documents from Data" from_port="example set" to_op="Generate Attributes" to_port="example set input"/>
              <connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
              <connect from_op="Filter Examples" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_example set" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
            <description align="center" color="yellow" colored="true" width="126">Loop through the keywords list</description>
          </operator>
          <connect from_op="Texts" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Remember" to_port="store"/>
          <connect from_op="Key Words" from_port="output" to_op="Loop Examples" to_port="example set"/>
          <connect from_op="Loop Examples" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


Sign In or Register to comment.