RapidMiner

Regular Expressions II - Needle I need You.

SOLVED
Regular Contributor

Regular Expressions II - Needle I need You.

[ Edited ]

Dear reader, this is a continuation of this thread:

 

http://community.rapidminer.com/t5/RapidMiner-Studio/Regular-expressions/m-p/37750#M26008

 

I opened a new one- because the original thread is solved.

 

I try to grep a word of a text-value to generate a value with the Generate Attribute oparator.

 

replaceAll(haystack, "hay (.*) hay", "$1")

 

...doesn't work. I guess I need needle as result, but it never appears in $1-$3 probably because it's replaced.

Alternative methods like match() and find() only deliver true/false.

 

Is there an example expression and function which collects an arbitrary word of an attribute containing text, to provide a value for a new attribute (without text-extension)?

 

needle.png

 

Needle I need you - Song!

https://www.youtube.com/watch?v=rNS6D4hSQdA

 

 

 

See more topics labeled with:

10 REPLIES
Community Manager

Re: Regular Expressions II - Needle I need You.

Oh no, you started a new thread and didn't tag anyone. They probably won't see this right away. 

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Regular Contributor

Re: Regular Expressions II - Needle I need You.

[ Edited ]

Flexibility Requirement:

 

Let's take another source text (real live example).

 source:

SELECT * FROM
myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33

 

Now I want to retrieve myNeedleTable and later 30.

 regex:

(?mi)(from)(.*)(order)

 

..shows a strange Result preview, though Result List shows the right Group Matches.

That's why it replaces the matching string. If I invert the selection it does not match myNeedleTable.

 

I think I have to turn 

(?mi)(from)(.*)(order)

...into its opposite.

 

(?!(from)(.*)(order))

...does not work.

RMStaff

Re: Regular Expressions II - Needle I need You.

Tadaa Smiley Happy

 

image.png

 

Or separately

(?mi).*from\s?(\S*).*
(?is).*limit\s?(\S*).*

 

Regular Contributor

Re: Regular Expressions II - Needle I need You.

 

You are a Regex Guru.

 

How often do you do regex- every day?

 

(?mius)((?!(from)(.*)(order)).) left me one "F" to much.

 

I will choose your solution. Thank you!

Regular Contributor

Re: Regular Expressions II - Needle I need You.

[ Edited ]

 

I get other results then you...strange...

strange.png

 

Something funny:

 

SELECT *  
FROM My_other_table order by drDateDate  ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM My_other_table)-33

strange2.png

 

I think since RM seems to have no real matching function native (without extension),

I should consider if Regex is the right approach for my pipieline. I would need it as much flexible as possible. 

Performance would also be a considerable point. The detour via replace seems a little bit 'overkill' too.

 

One of the functions matches() or find() should return needle. That would be the real smart deal.  

Elite II

Re: Regular Expressions II - Needle I need You.

In your first picture, you use \s?. This means "0 or 1 whitespace characters". The line feed is 1 or even 2 characters depending on your operating system.

 

I always use \s+ so I'm independent of the number of whitespace characters. If it is syntactically possible to leave out the whitespace, you could also use \s*. 

 

Regards,

 

Balázs

--
Balázs Bárány
Data Scientist, Vienna
https://datascientist.at
Community Manager

Re: Regular Expressions II - Needle I need You.

Just an FYI, the Generate Attributes operator does have contains() and matches() functions. You can also you use RegEx in there and create new columns (attributes).

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Regular Contributor

Re: Regular Expressions II - Needle I need You.

[ Edited ]

Thank you, Do you have an example where contains() or matches() returns needle (the searched word)?

My search term isn't static so "(.*)".

RMStaff

Re: Regular Expressions II - Needle I need You.

@BalazsBarany: Thank you for pointing out the problem with multiple whitespaces - I included them in this answer

 

Finally:

 

Attached is the XML for the process. For those not having access to version 7.4. some screenshots and explanations can be found below.

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
        <list key="attribute_values">
          <parameter key="haystack" value="&quot;SELECT * FROM&#13;&#10;myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="single searchword" width="90" x="179" y="34">
        <parameter key="macro" value="searchword"/>
        <parameter key="value" value="needle"/>
      </operator>
      <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="searched tablename regex" width="90" x="313" y="34">
        <parameter key="macro" value="regexTablename"/>
        <parameter key="value" value="(?di).*from\s+(\S+)\s+order.*"/>
      </operator>
      <operator activated="true" class="set_macro" compatibility="7.4.000" expanded="true" height="82" name="searched limit regex" width="90" x="447" y="34">
        <parameter key="macro" value="regexLimit"/>
        <parameter key="value" value="(?di).*limit\s+(\S+)\s+offset.*"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34">
        <list key="function_descriptions">
          <parameter key="searchword extraction" value="if(matches(haystack,&quot;(?di).*&quot;+%{searchword}+&quot;.*&quot;),replaceAll(haystack,&quot;(?di).*(&quot;+%{searchword}+&quot;).*&quot;,&quot;$1&quot;),haystack)"/>
          <parameter key="test if searchword matches" value="matches(haystack,&quot;(?di).*needle.*&quot;)"/>
          <parameter key="tablename" value="replaceAll(haystack,%{regexTablename},&quot;$1&quot;)"/>
          <parameter key="limit" value="replaceAll(haystack,%{regexLimit},&quot;$1&quot;)"/>
        </list>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="single searchword" to_port="through 1"/>
      <connect from_op="single searchword" from_port="through 1" to_op="searched tablename regex" to_port="through 1"/>
      <connect from_op="searched tablename regex" from_port="through 1" to_op="searched limit regex" to_port="through 1"/>
      <connect from_op="searched limit regex" from_port="through 1" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

image.png

As you can see the searchword extraction uses the matches() function inside an if clause. Thus if the RegEx matches the replaceAll() function is only then applied.

 

Resulting ExampleSet:

image.png

 

The string which is tested (i.e. the content of the attribute haystack) is

SELECT * FROM
myNeedleTable order by drDate ASC LIMIT 30 OFFSET (SELECT COUNT(*) FROM myNeedleTable)-33

 

The regex for the tablename is

(?di).*from\s+(\S+)\s+order.*

 

The regex for the limit is

(?di).*limit\s+(\S+)\s+offset.*

 

Important is the (?di) for dotall and case-insensitive mode for full replacement of the whole content.

 

Afterwards you can use Extract Macro for getting a variable with the desired value.

 

Hope this works for you Smiley Happy

 

Best regards,

Edin