Can I use POS expressions (chuncks) with the text mining operators?

kaymankayman Member Posts: 662 Unicorn
edited November 2018 in Help

Hi there,

 

I know I can use the Filter tokens (by POS tags) to filter out single POS tags, but how would I generate chuncks ?

 

I am for instance interested in combinations of adjectives and nouns, or noun sequences, but this does not seem to work for me

 

Let's assume I have a dummy sentence like this one : "I have a broken computer, there is no picture, this thing sucks"

 

I would like to chunck this uing for instance (JJ.* NN.*+)|(DT NN.*+)|NN.*+

-> so either an adjective folowed by noun(s), or a determinator followed by a noun, or a simple noun phrase

 

so my output would become after some further processing something like [broken computer],[no picture],[thing sucks]

 

But the operater seems to accept single POS tags only. Is this correct or am I doing it completely wrong?

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    What is your tokenizer set at? non letters? Have you tried setting it to linguistic sentences?

  • kaymankayman Member Posts: 662 Unicorn

    Hi Thomas,

     

    As I was testing on single sentences (handpicked) I did not use a tokenizer yet. The POS operator works pretty ok when selecting a single POS (like JJ.*|NN.*) but seems to be unable to handle sequences (so like any JJ followed by NN).

    I can do the same with a python operator so I am not really stuck if RM dos not support it, it would just be nice to be able to do it with the standard operators. maybe something for the next version ?

    Or I may be having issues with the syntax, not too sure about that one either

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    It should be able to do it because on your regex structure, but I think it needs to operate inside the Process Doc from Data operator with a Tokenize operator set to linguistic sentences. 

     

    Try this:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="112" y="34">
    <parameter key="connection" value="ThomasOtt"/>
    <parameter key="query" value="#iphone"/>
    <parameter key="limit" value="10"/>
    <parameter key="language" value="en"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.4.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.4.000" expanded="true" height="82" name="Nominal to Text" width="90" x="380" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.4.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="34">
    <parameter key="prune_method" value="percentual"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34">
    <parameter key="mode" value="linguistic sentences"/>
    </operator>
    <operator activated="true" class="text:filter_tokens_by_pos" compatibility="7.4.001" expanded="true" height="68" name="Filter Tokens (by POS Tags)" width="90" x="179" y="34">
    <parameter key="expression" value="JJ.*|NN.*"/>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by POS Tags)" to_port="document"/>
    <connect from_op="Filter Tokens (by POS Tags)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • kaymankayman Member Posts: 662 Unicorn

    Hi Thomas,

     

    maybe I was not clear enough. The example you show works, but it filters on either JJ or NN tag, not on a sequence of these. The OR worked for me also (even without tokenizing on sentences) but I need more of an AND scenario

    What I need to achieve is a filter on JJ, only if followed by one or more NN (or other combinations)

     

    Assume I have following sentences :

    "hello what do I need to be able to group multiple pos tokens? Can I use regular groups or is that too complex?"

    Using a chunkrule like this one in python <JJ><NN.*>+ would return me 

    ['multiple pos tokens', 'regular groups']

    Using the expression JJ.*|NN.* as in the RM example logic correctly returns 

     

    ['able', 'group', 'multiple', 'pos', 'tokens', 'regular', 'groups', 'complex']
     

     

    So the option to group POS tags provides much more powerful options, but given that using an expression like JJ NN.* returns an empty match I assume this is not possible.

     

    Hope this makes it more clear

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Hmm, in this case I'm stumped. Maybe @mschmitz has an idea. 

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Puh, this is rather a question for @hhomburg or @RalfKlinkenberg

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.