RapidMiner

Regular expressions?

SOLVED
Elite II

Re: Regular expressions?

agreed.  I battle with various RegEx expressions all the time, and what works in one place does not work in RapidMiner and vice versa.  This is my go-to link for RegEx in general but it does not always do the trick.

 

My current challenge (for anyone wanting a RegEx challenge!) is to get an expression that I can use in Split so that a nominal field will split every n occurences of a character, instead of every one.  For example:

 

att1:       1,2,3,4,5,6,7,8,9,10,11,12

 

If I just use Split on RegEx [,], I of course get

 

att1_1     att1_2      att1_3      att1_4     ...

1               2                3                4               ...

 

But what I want is

att1_1            att1_2            att1_3

1,2,3,4           5,6,7,8            9,10,11,12

 

I am literally pulling my hair out on this one!  I'm happy to contribute to a community RapidMiner RegEx database any time.

 

Scott

Scott Genzer
Certified RapidMiner Analyst
Genzer Consulting

Re: Regular expressions?

A regular expression challenge? I'll bite ;-) Hopefully I can save a few hairs on your head.

 

It doesn't work directly with Split, but if you do a replace before, you can solve it.

The replace replaces the 4th, 8th etc. comma with ||, the split uses that for splitting.

Replace, search string: (([^,]+,){3}[^,]+),

Replace, replacement: $1||

Split, pattern: \|\|

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
        <list key="attribute_values">
          <parameter key="att1" value="&quot;1,2,3,4,5,6,7,8,9,10,11,12&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace" width="90" x="246" y="34">
        <parameter key="replace_what" value="(([^,]+,){3}[^,]+),"/>
        <parameter key="replace_by" value="$1||"/>
      </operator>
      <operator activated="true" class="split" compatibility="7.4.000" expanded="true" height="82" name="Split" width="90" x="447" y="34">
        <parameter key="split_pattern" value="\|\|"/>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_op="Split" to_port="example set input"/>
      <connect from_op="Split" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Greetings from Vienna,

Balázs

--
Balázs Bárány
Data Scientist, Vienna
https://datascientist.at
Highlighted
RMStaff

Re: Regular expressions?

Great Balazs!

 

Just for completion: The reason why this does not work with a single Split is that Split removes the Pattern which matches the Expression.

I.e. even if you used the correct split pattern, the resulting attribute columns are mostly empty Smiley Happy

 

Best,

Edin

Elite II

Re: Regular expressions?

Balazs - that is very nice!  Thank you!  At my age all hairs on head are valuable.  Beverage of choice is on me if you're in the neighborhood. Smiley Happy

 

Scott

 

Scott Genzer
Certified RapidMiner Analyst
Genzer Consulting
Elite II

Re: Regular expressions?

and yes Edin you're right - Split always removes the selection.  If I want to keep it, I have to do some wonky workarounds with Replace every time.  Maybe a small improvement with Split would be a checkbox so that you could keep the selection and split directly before/after it?

 

Scott

Scott Genzer
Certified RapidMiner Analyst
Genzer Consulting
Regular Contributor

Re: Regular expressions?

I already have the new RM-version...but I saw the 'help-options' a little late. My fault.

Actually the example directly under the functions item in the list for regex-functions was enough, to give me the right direction, how it is implemented. The testing box of the replace operator is also great. Thank you RM-team!

 

So now the question of all questions...

 

Is there a function which uses regex and returns a needle in a haystack...not only true ore false? 

matches() & find() don't do this. I know there is a text-extension, but is there a native function?

 

Thank you very much to all.

And no, I don't will read another regex-tutorial...I had to much of them over the last 20 years.

It's like alcohol. First it's funny, but then it drives you crazy.

Heart.

 

Re: Regular expressions?

If you're interested in a match, use the common regular expression match syntax with the Replace operator.

 

Value: haystack hay hay needle hay hay hay

 

Regex: .*(needle).*

Replacement: $1

 

Will give you the needle in your attribute value. You can also use the replaceAll() function in Generate Attributes.

 

Here's a sample process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
        <list key="attribute_values">
          <parameter key="haystack" value="&quot;haystack hay hay needle hay hay whatever&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="34">
        <list key="function_descriptions">
          <parameter key="original" value="haystack"/>
          <parameter key="needle with Generate Attributes" value="replaceAll(haystack, &quot;.*(needle).*&quot;, &quot;$1&quot;)"/>
        </list>
      </operator>
      <operator activated="true" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace" width="90" x="447" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="haystack"/>
        <parameter key="replace_what" value=".*(needle).*"/>
        <parameter key="replace_by" value="$1"/>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
--
Balázs Bárány
Data Scientist, Vienna
https://datascientist.at
Regular Contributor

Re: Regular expressions?

[ Edited ]

mistakenly doubled.

Regular Contributor

Re: Regular expressions?

[ Edited ]

@BalazsBarany Thank you.

 

Acutally I need a variable needle.

Something like

 replaceAll(haystack, "(hay)(.*)(hay)", "$1")

returns not needle alone....hm whre are the individual retruns stored and why should I use a replace?

I mean... 

(hay)(.*)(hay)

...matches. But how to access matching group 2 in RM to write a new value?

In PHP for example, I get an array of all matching groups as return.

 

needle.png

 

Needle I need you!

https://www.youtube.com/watch?v=rNS6D4hSQdA

Re: Regular expressions?

RapidMiner is not PHP. 

 

Here, you use $1, $2, $3 etc. to refer to the matched elements.

 

If you need a variable needle, you can use macros (for the Replace operator) or the attribute value in Generate Attributes.

--
Balázs Bárány
Data Scientist, Vienna
https://datascientist.at