RapidMiner

Regular Expressions II - Needle I need You.

SOLVED
Regular Contributor

Regular Expressions II - Needle I need You.

[ Edited ]

Dear reader, this is a continuation of this thread:

 

http://community.rapidminer.com/t5/RapidMiner-Studio/Regular-expressions/m-p/37750#M26008

 

I opened a new one- because the original thread is solved.

 

I try to grep a word of a text-value to generate a value with the Generate Attribute oparator.

 

replaceAll(haystack, "hay (.*) hay", "$1")

 

...doesn't work. I guess I need needle as result, but it never appears in $1-$3 probably because it's replaced.

Alternative methods like match() and find() only deliver true/false.

 

Is there an example expression and function which collects an arbitrary word of an attribute containing text, to provide a value for a new attribute (without text-extension)?

 

needle.png

 

Needle I need you - Song!

https://www.youtube.com/watch?v=rNS6D4hSQdA

 

 

 

See more topics labeled with:

10 REPLIES
Community Manager

Re: Regular Expressions II - Needle I need You.

Oh no, you started a new thread and didn't tag anyone. They probably won't see this right away. 

Regards,
Thomas
LinkedIn: Thomas Ott
Blog: Neural Market Trends
Regular Contributor

Re: Regular Expressions II - Needle I need You.

[ Edited ]

Flexibility Requirement:

 

Let's take another source text (real live example).

 source:

SELECT * FROM
myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33

 

Now I want to retrieve myNeedleTable and later 30.

 regex:

(?mi)(from)(.*)(order)

 

..shows a strange Result preview, though Result List shows the right Group Matches.

That's why it replaces the matching string. If I invert the selection it does not match myNeedleTable.

 

I think I have to turn 

(?mi)(from)(.*)(order)

...into its opposite.

 

(?!(from)(.*)(order))

...does not work.

Highlighted
RMStaff

Re: Regular Expressions II - Needle I need You.

Tadaa Smiley Happy

 

image.png

 

Or separately

(?mi).*from\s?(\S*).*
(?is).*limit\s?(\S*).*

 

Regular Contributor

Re: Regular Expressions II - Needle I need You.

 

You are a Regex Guru.

 

How often do you do regex- every day?

 

(?mius)((?!(from)(.*)(order)).) left me one "F" to much.

 

I will choose your solution. Thank you!

Regular Contributor

Re: Regular Expressions II - Needle I need You.

[ Edited ]

 

I get other results then you...strange...

strange.png

 

Something funny:

 

SELECT *  
FROM My_other_table order by drDateDate  ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM My_other_table)-33

strange2.png

 

I think since RM seems to have no real matching function native (without extension),

I should consider if Regex is the right approach for my pipieline. I would need it as much flexible as possible. 

Performance would also be a considerable point. The detour via replace seems a little bit 'overkill' too.

 

One of the functions matches() or find() should return needle. That would be the real smart deal.  

Elite II

Re: Regular Expressions II - Needle I need You.

In your first picture, you use \s?. This means "0 or 1 whitespace characters". The line feed is 1 or even 2 characters depending on your operating system.

 

I always use \s+ so I'm independent of the number of whitespace characters. If it is syntactically possible to leave out the whitespace, you could also use \s*. 

 

Regards,

 

Balázs

--
Balázs Bárány
Data Scientist, Vienna
https://datascientist.at
Community Manager

Re: Regular Expressions II - Needle I need You.

Just an FYI, the Generate Attributes operator does have contains() and matches() functions. You can also you use RegEx in there and create new columns (attributes).

Regards,
Thomas
LinkedIn: Thomas Ott
Blog: Neural Market Trends
Regular Contributor

Re: Regular Expressions II - Needle I need You.

[ Edited ]

Thank you, Do you have an example where contains() or matches() returns needle (the searched word)?

My search term isn't static so "(.*)".

RMStaff

Re: Regular Expressions II - Needle I need You.

@BalazsBarany: Thank you for pointing out the problem with multiple whitespaces - I included them in this answer

 

Finally:

 

Attached is the XML for the process. For those not having access to version 7.4. some screenshots and explanations can be found below.

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
        <list key="attribute_values">
          <parameter key="haystack" value="&quot;SELECT * FROM&#13;&#10;myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="single searchword" width="90" x="179" y="34">
        <parameter key="macro" value="searchword"/>
        <parameter key="value" value="needle"/>
      </operator>
      <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="searched tablename regex" width="90" x="313" y="34">
        <parameter key="macro" value="regexTablename"/>
        <parameter key="value" value="(?di).*from\s+(\S+)\s+order.*"/>
      </operator>
      <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="searched limit regex" width="90" x="447" y="34">
        <parameter key="macro" value="regexLimit"/>
        <parameter key="value" value="(?di).*limit\s+(\S+)\s+offset.*"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34">
        <list key="function_descriptions">
          <parameter key="searchword extraction" value="if(matches(haystack,&quot;(?di).*&quot;+%{searchword}+&quot;.*&quot;),replaceAll(haystack,&quot;(?di).*(&quot;+%{searchword}+&quot;).*&quot;,&quot;$1&quot;),haystack)"/>
          <parameter key="test if searchword matches" value="matches(haystack,&quot;(?di).*needle.*&quot;)"/>
          <parameter key="tablename" value="replaceAll(haystack,%{regexTablename},&quot;$1&quot;)"/>
          <parameter key="limit" value="replaceAll(haystack,%{regexLimit},&quot;$1&quot;)"/>
        </list>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="single searchword" to_port="through 1"/>
      <connect from_op="single searchword" from_port="through 1" to_op="searched tablename regex" to_port="through 1"/>
      <connect from_op="searched tablename regex" from_port="through 1" to_op="searched limit regex" to_port="through 1"/>
      <connect from_op="searched limit regex" from_port="through 1" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

image.png

As you can see the searchword extraction uses the matches() function inside an if clause. Thus if the RegEx matches the replaceAll() function is only then applied.

 

Resulting ExampleSet:

image.png

 

The string which is tested (i.e. the content of the attribute haystack) is

SELECT * FROM
myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33

 

The regex for the tablename is

(?di).*from\s+(\S+)\s+order.*

 

The regex for the limit is

(?di).*limit\s+(\S+)\s+offset.*

 

Important is the (?di) for dotall and case-insensitive mode for full replacement of the whole content.

 

Afterwards you can use Extract Macro for getting a variable with the desired value.

 

Hope this works for you Smiley Happy

 

Best regards,

Edin