Text analysis, word count

dynadyna Member Posts: 2 Contributor I
edited November 2018 in Help
Hello,

I am trying to count the number of specific words in pdf files, which works fine in general (operator create document and process document to create the list of words I am looking for and operator process documents from files to read in the pdf files). But I have two specific questions/problems:

1. What do I have to do when I want that it does not only count the exact word but all words starting with the expression? For example: I want all words starting with "risk". So it should not only count the word "risk", but also "risks", "risky" and so on all together.

2. What do I have to do when I want that it counts two specific words in a row? For example: I want to count all occurences of "liquidity risk", not "liquidity" or "risk" alone. Also, then it shouldn't add this occurence of the term "risk" to the first search with all words starting with risks.

Thank you so much in advance for your help!!

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    for 1: Have a look at Stem (Dictionary) - that should help.

    for 2: I guess a simple replace dictionary would do the trick? Otherwise i would recommend to use 2_grams.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • dynadyna Member Posts: 2 Contributor I
    Hello,

    thank you very much for your quick and helpful response!

    Stem (Dictionary) really solved my first problem.

    But I am still having trouble with searching for two words in a row, even with 2_grams. Here again a concrete example: I am looking for the term reputation.* risk.* (reputational risk, reputational risks, reputation risk, reputational risks). So I create a document with the words "reputation" and "risk", process it (tokenize, transform cases and generate n-grams). Then I process the pdf-files (tokenize, transform cases, generate n-grams and stem (dictionary) so that everything starting with reputation is reduced to reputation and everything starting with risk is reduced to risk). But the problem now is that the output shows me 0 counts for "reputation risk", the counts for "reputation" and "risk" alone work, though.

    Do you have any suggestions how I can alter/fix the process so that it shows me the right number for "reputation.* risk.*"?

    Thank you so much for your efforts!

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist
    Hi,

    sorry i only have time for a quick note: Have you had a look on Extract Information?

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    What order do you have them in? 

    tokenize, transform cases, generate n-grams and stem (dictionary) ?

    Shouldn't it really be


    tokenize, transform cases, generate stem (dictionary)and n-grams because you want to stem first and then create the n-grams? 
  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    hi,

    i have issues in word counting, when process the text some words are like

    aaaaa
    aaaaaa
    aaaeee
    aaahh
    aachen
    aamcgol
    aanda
    aandm

    how can i remove  it

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @rajbanokhan - can you please post your XML process so we can see?  Please see "READ BEFORE POSTING" pane on the right hand side of your Reply window for instructions.

     

    Scott

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    It might be related to Pruning. Do you have any pruning turned on? That usually gets rid of weird stuff like that. Pruning is a parameter setting on the Process Documents from... operator.
  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    i dont know about pruning but i use these steps

    Tokenize Nonletters (Tokenize)
     Tokenize Linguistic (Tokenize)
     Filter Stopwords (English)
     Filter Tokens (by Length)
     Stem (Porter)
     Transform Cases

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Pruning is available via the Process Documents from Data operator. Your Tokenize, Stem, etc should all be inside that subprocess.

  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    yes i apply prune method and its work thank you so much

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    Awesome! Now if you could Accept my suggestion as the solution I'd be super happy!
  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    hi i am doing textmining i want the most frequent word value come first. like if

    banana occurences 10 and car occurence is 8 then it come like

    banana 10

    car         8

    how to count from higher to lower

  • kaymankayman Member Posts: 662 Unicorn

    If you use the wordlist to data operator your list becomes an example set, and then you can use the sort and filter operators. 

  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    hi

    i am using sort technique for sorting but in attribute names the options are

    label

    meta data date

    meta data file

    meta data path

    i select label option but i am not get the sorting data or you can say most frequent value

     

  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    hi

    when i used wordlist to data then i use sort operator the parameter "attribute names" were not show their options

    so thats why i only use process documents from files and sort but sort doesnt sort my data

  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    hi

    i am doing textmining. i use process document from files operator. when i run the process it gives me a list of words but i dont want the whole list of words. i just want select my own words from the list which i want. suppose i want words cat, dog, mouse, table chair. how can i get these words only these words from list.

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

    have a look at the "Filter Tokens Using Example Set" operator of operator toolbox. This should do the trick.

     

    Cheers,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    hi

    thank you to focus on my question.

    but i dont get the operator .

    is this an filter example operator or filter tokens (by content) can you guide me because filter example is not working it show empty (no words are show)

    i use both one by one and filter tokens by content work. and thanks again for

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

    did you install operator toolbox extension?

     

    BR,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    hi sir

    i install textmining extension and i didnt find this extension in searching of extension in marketplace

     

     

     

     

  • rajbanokhanrajbanokhan Member Posts: 29 Maven

    hi didnt find operator tool box extension

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @rajbanokhan so both of those extensions can be found in the marketplace. If you open RapidMiner Studio, you should see a menu at the top called "Extensions". Choose the first item "Marketplace (Updates and Extensions)... Then search for "Text Processing" and "Operator Toolbox".


    Scott

     

  • rajbanokhanrajbanokhan Member Posts: 29 Maven
    hi
    how i find or count the total number of words in one document and then in second and then third and so on?
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @rajbanokhan - very easy to count tokens. See the attached process and image below.





    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.0.003" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
            <parameter key="text" value="hi how i find or count the total number of words in one document and then in second and then third and so on?"/>
          </operator>
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
          <operator activated="true" class="text:extract_token_number" compatibility="8.1.000" expanded="true" height="68" name="Extract Token Number" width="90" x="313" y="34"/>
          <connect from_op="Create Document" from_port="output" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Extract Token Number" to_port="document"/>
          <connect from_op="Extract Token Number" from_port="document" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    Scott

  • rajbanokhanrajbanokhan Member Posts: 29 Maven
    hi sir
    thank you for your response.
    i am using process documents files. it has only one folder and inside folder there are seven pdf files now can you tell me how total words count from each or all.
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    can you please post your xml?
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    The token number can be saved as part of the metadata output from Process Documents (see screenshot above) by selecting the "add meta information" option, which then adds token number as an attribute.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.