Regular Expressions II - Needle I need You.

listslists Member Posts: 39 Guru
edited November 2018 in Help

Dear reader, this is a continuation of this thread:

 

http://community.rapidminer.com/t5/RapidMiner-Studio/Regular-expressions/m-p/37750#M26008

 

I opened a new one- because the original thread is solved.

 

I try to grep a word of a text-value to generate a value with the Generate Attribute oparator.

 

replaceAll(haystack, "hay (.*) hay", "$1")

 

...doesn't work. I guess I need needle as result, but it never appears in $1-$3 probably because it's replaced.

Alternative methods like match() and find() only deliver true/false.

 

Is there an example expression and function which collects an arbitrary word of an attribute containing text, to provide a value for a new attribute (without text-extension)?

 

needle.png

 

Needle I need you - Song!

https://www.youtube.com/watch?v=rNS6D4hSQdA

 

 

 

Tagged:

Best Answer

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist
    Solution Accepted

    @BalazsBarany: Thank you for pointing out the problem with multiple whitespaces - I included them in this answer

     

    Finally:

     

    Attached is the XML for the process. For those not having access to version 7.4. some screenshots and explanations can be found below.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="haystack" value="&quot;SELECT * FROM&#13;&#10;myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="single searchword" width="90" x="179" y="34">
    <parameter key="macro" value="searchword"/>
    <parameter key="value" value="needle"/>
    </operator>
    <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="searched tablename regex" width="90" x="313" y="34">
    <parameter key="macro" value="regexTablename"/>
    <parameter key="value" value="(?di).*from\s+(\S+)\s+order.*"/>
    </operator>
    <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="searched limit regex" width="90" x="447" y="34">
    <parameter key="macro" value="regexLimit"/>
    <parameter key="value" value="(?di).*limit\s+(\S+)\s+offset.*"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34">
    <list key="function_descriptions">
    <parameter key="searchword extraction" value="if(matches(haystack,&quot;(?di).*&quot;+%{searchword}+&quot;.*&quot;),replaceAll(haystack,&quot;(?di).*(&quot;+%{searchword}+&quot;).*&quot;,&quot;$1&quot;),haystack)"/>
    <parameter key="test if searchword matches" value="matches(haystack,&quot;(?di).*needle.*&quot;)"/>
    <parameter key="tablename" value="replaceAll(haystack,%{regexTablename},&quot;$1&quot;)"/>
    <parameter key="limit" value="replaceAll(haystack,%{regexLimit},&quot;$1&quot;)"/>
    </list>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="single searchword" to_port="through 1"/>
    <connect from_op="single searchword" from_port="through 1" to_op="searched tablename regex" to_port="through 1"/>
    <connect from_op="searched tablename regex" from_port="through 1" to_op="searched limit regex" to_port="through 1"/>
    <connect from_op="searched limit regex" from_port="through 1" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    image.png

    As you can see the searchword extraction uses the matches() function inside an if clause. Thus if the RegEx matches the replaceAll() function is only then applied.

     

    Resulting ExampleSet:

    image.png

     

    The string which is tested (i.e. the content of the attribute haystack) is

    SELECT * FROM
    myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33

     

    The regex for the tablename is

    (?di).*from\s+(\S+)\s+order.*

     

    The regex for the limit is

    (?di).*limit\s+(\S+)\s+offset.*

     

    Important is the (?di) for dotall and case-insensitive mode for full replacement of the whole content.

     

    Afterwards you can use Extract Macro for getting a variable with the desired value.

     

    Hope this works for you :)

     

    Best regards,

    Edin

     

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Oh no, you started a new thread and didn't tag anyone. They probably won't see this right away. 

  • listslists Member Posts: 39 Guru

    Flexibility Requirement:

     

    Let's take another source text (real live example).

     source:

    SELECT * FROM
    myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33

     

    Now I want to retrieve myNeedleTable and later 30.

     regex:

    (?mi)(from)(.*)(order)

     

    ..shows a strange Result preview, though Result List shows the right Group Matches.

    That's why it replaces the matching string. If I invert the selection it does not match myNeedleTable.

     

    I think I have to turn 

    (?mi)(from)(.*)(order)

    ...into its opposite.

     

    (?!(from)(.*)(order))

    ...does not work.

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist

    Tadaa :)

     

    image.png

     

    Or separately

    (?mi).*from\s?(\S*).*
    (?is).*limit\s?(\S*).*

     

  • listslists Member Posts: 39 Guru

     

    You are a Regex Guru.

     

    How often do you do regex- every day?

     

    (?mius)((?!(from)(.*)(order)).) left me one "F" to much.

     

    I will choose your solution. Thank you!

  • listslists Member Posts: 39 Guru

     

    I get other results then you...strange...

    strange.png

     

    Something funny:

     

    SELECT *  
    FROM My_other_table order by drDateDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM My_other_table)-33

    strange2.png

     

    I think since RM seems to have no real matching function native (without extension),

    I should consider if Regex is the right approach for my pipieline. I would need it as much flexible as possible. 

    Performance would also be a considerable point. The detour via replace seems a little bit 'overkill' too.

     

    One of the functions matches() or find() should return needle. That would be the real smart deal.  

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    In your first picture, you use \s?. This means "0 or 1 whitespace characters". The line feed is 1 or even 2 characters depending on your operating system.

     

    I always use \s+ so I'm independent of the number of whitespace characters. If it is syntactically possible to leave out the whitespace, you could also use \s*. 

     

    Regards,

     

    Balázs

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Just an FYI, the Generate Attributes operator does have contains() and matches() functions. You can also you use RegEx in there and create new columns (attributes).

  • listslists Member Posts: 39 Guru

    Thank you, Do you have an example where contains() or matches() returns needle (the searched word)?

    My search term isn't static so "(.*)".

  • listslists Member Posts: 39 Guru

    @Edin_Klapic 

     

    Wow thank you. That probably was a lot of work.

     

    As I said before, I changed my strategy/pipeline.

    I found initial attributing/tagging important stuff (with 'talking declarations') while data preparation, makes much more sense in my case, then extracting informations via replace() to generate attrbutes later in the pipeline. My data source is now "normalized" and optimized.

     

    I'm still not convinced that replacing should substitute a generic matching in most cases. 

     

    But I'm sure there will be other cases, when this example could be helpful. 

    So a great contribution to the community.

Sign In or Register to comment.