Text analysis of single words

BuggiaBuggia Member Posts: 4 Newbie
Hi everyone. I am struggling with a text analysis. I've done all the process in order to transform and tokenize all my document. But now I need to find what are the words "related" to other specific words. For example, I want to find, in all my document, all the words which come after the word "I", "we" and "you".
I tried many different operators but I can't come up with a solution. 
Thank for your help

Best Answers

  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted
    Hi Buggia!

    You could try creating "term n-grams" with n = 2. This would give you all combinations of "I word", "we word" etc. Then you would filter for the terms with the prefixes you're interested in (I, we, ...) and extract the word after the space.

    Here's an example process:
    <?xml version="1.0" encoding="UTF-8"?><process version="9.10.001">
      <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="-1"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="9.4.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
            <parameter key="text" value="This is my silly text with some combinations of &quot;I am&quot;, &quot;I will&quot;, &quot;I won't&quot;, &#10;&quot;we had&quot;, &quot;we have&quot; and &quot;we don't have&quot;. And again &quot;I am&quot;.&#10;"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          <operator activated="true" class="text:process_documents" compatibility="9.4.000" expanded="true" height="103" name="Process Documents" width="90" x="179" y="34">
            <parameter key="create_word_vector" value="true"/>
            <parameter key="vector_creation" value="TF-IDF"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="keep_text" value="true"/>
            <parameter key="prune_method" value="none"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="9.4.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34">
                <parameter key="mode" value="non letters"/>
                <parameter key="characters" value=".:"/>
                <parameter key="language" value="English"/>
                <parameter key="max_token_length" value="3"/>
              <operator activated="true" class="text:generate_n_grams_terms" compatibility="9.4.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="246" y="34">
                <parameter key="max_length" value="2"/>
              <operator activated="true" class="text:filter_tokens_by_content" compatibility="9.4.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="380" y="34">
                <parameter key="condition" value="matches"/>
                <parameter key="regular_expression" value="^(I_|we_).+"/>
                <parameter key="case_sensitive" value="false"/>
                <parameter key="invert condition" value="false"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
              <connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
              <connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
          <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>

  • Options
    BuggiaBuggia Member Posts: 4 Newbie
    Solution Accepted
    Hi BalazsBarany
    Thank you for you kind answer. Since I am not very familiar with coding language, could you please explain to me in terms of "operatos" involved in the process?
    Thank you again for your help.
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted
    Hi Buggia!

    The first operator just creates a document with an example text. Its output goes to "Process Documents". This is a container for additional operators to be executed inside.

    Tokenize splits the words into single units on "word boundaries" like spaces. 
    Generate n-grams (Terms) creates every combination of word pairs. (There's Generate n-grams (Characters) that would do the same but for characters inside the words.) 
    Filter Tokens by Content keeps the generated "tokens" (the n-grams) that match a regular expression. Here I used ^(I_|we_).+ to refer to I or we as words in the beginning of the token. These are the words you are searching for. If you want to extend the regular expression, add your term inside the parentheses with the pipe | as the separator.

    And that's it. The wordlist output contains the combinations found in the text and their frequency.

    BTW, every operator has extensive documentation in the Help tab in Studio.



  • Options
    BuggiaBuggia Member Posts: 4 Newbie
    It works. Thank you so much. You are amazing.
  • Options
    BuggiaBuggia Member Posts: 4 Newbie
    I open again this topic because I have another question regarding this proces. The procedure works just fine and I managed to obtain my results. But I don't understand how the software assigns certain value to a specific set of word. For example, the set "I_bought" is equal to 0.303 in row number 295 and 0.278 in row number 191. What these numbers refer to?
  • Options
    BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    The default way to create attributes in a text mining context is TF-IDF: Term Frequency, Inverse Document Frequency. 
    Term Frequency: How often is a word (token) in a document.
    Inverse Document Frequency: In how many documents the word (token) is.

    You can select another method in the "vector creation" parameter of "Process Documents". For example, Term Occurences just gives you the number.

    The Word list output always contains the absolute numbers, that's why I recommended to use that. There's an operator "WordList to data" for converting the special table to a normal one, for example for further processing or putting the contents into a database.

Sign In or Register to comment.