Count UPPER CASE Tokens

nfridge1nfridge1 Member Posts: 1 Contributor I
edited May 2020 in Help
I have a spreadsheet with a text column and a label column. I would like to represent text values with some token metadata. I'm using "process documents". In "process documents" I'm tokenizingo the text value. I would like to achieve the following:
1. Add an attribute to the exampleset which contains a count of the number of tokens which were UPPER CASE.
2. Add an attribute to the exampleset which is a count of the number of adjective tokens.
On point (2) I have made some progress by using "filter tokens by pos tag". This doesn't give me quite what I want though. I want a count  of the number of adjectives, not just bag-of-words filtered to only contain adjectives.
On point (1) I have no ideas for how to proceed.
Thank you.

Best Answer


  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    This is an interesting question.  There are probably multiple ways of doing this and I'd be interested to see what some of our regex wizards like @kayman have to say.
    But my first suggestion for #1 is to use the Generate Aggregation operator with the "count" function and use a regular expression to select only those attributes with a name that is entirely uppercase, which would be: [A-Z]+
    (and this could be modified if you want to allow numbers or other special characters as well).
    For the 2nd one, once you have a dataset with just the adjective tokens, you can skip the regular expression filtering and just use Generate Aggregation directly to get the count.  
    In both cases, this will provide the count for all tokens, regardless of whether they are in each individual document or not.
    If you want a count of only the ones that appeared in each document, in the Process Documents operator you could use the word vector creation method of binary term occurrences and then simply use the sum function inside Generate Aggregation instead.  Or use term occurrences as your word vector creation method and then the sum function will give you the actual count of such tokens.  So you have several options.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.