"Regular expressions?"

sgtrock · April 2011

I've been messing around with text processing for log analysis for almost 30 years. I've used a variety of languages in that time.

One of the more painful facts that I've had to learn is that nobody implements regular expressions in quite the same way. I'm tripping over this yet again with RapidMiner and it's becoming a real source of frustration for me. The manual is silent on the subject. (Given how critical this function is to data analysis, I found that lack to be disturbing to say the least!).

Does anyone out there know of a good resource for creating regex in RapidMiner? Searching the forum archives repeatedly turns up references to Java's regular expression documentation. That in turn refers back to _perl's_ regex documentation, with exceptions noted.

None of that documenation tells me how RapidMiner 5.0 actually interprets regexes. Attempting to find the right syntax is chewing up a great deal of my time. Is there's a broad set of examples out there? Or a truly thorough discussion of how to properly define regexes within the GUI? I'd love to read something like that.

For example, take my current struggle. I'm wading through a long list of software that was entered in by several different people over the years. In addition, the rules for what was entered where and when changed as the database grew.

The first thing that I want to do is simply count the number of versions of software that's out there. Unfortunately, a lot of the old entries include the version number as part of the asset name, so I have to strip that out. With me so far?

Here's a typical example (all examples from the same attribute, Asset):

Illustrator
Illustrator CS
Illustrator CS2
Illustrator CS3
Illustrator CS4
Dreamweaver
Dreamweaver CS3
Dreamweaver CS4
Photoshop
Photoshop CS
Photoshop CS4

In this instance, I'd love to just look for CS and CS[2-4] and strip them off the entries.

And another:

Extra! v6.7
Extra! v9

And another:


Netware Client v4.9 SP1 (IP)
Netware Client v4.9 SP1 (IP/IPX)
Netware Client 4

There's a lot more where these came from. At the moment, I'm tackling this by going through the list, adding one entry at a time to a Map operator. It's a painful, trial and error process at this point.

The issue that I'm struggling with is that I can't predict what end result I'm going to get. For example, this works:

Illustrator CS[2-4]   :  Illustrator

but naturally doesn't eliminate the Illustrator CS entry. I had to created a separate Map entry for it. However, now I need to do the same for all the other programs in the Adobe Creative Suite.

This doesn't:

Extra!\sv*  :  Extra!

Neither does this:

Extra! v*  :  Extra!

Nor does this:

Netware Client *   :  Netware Client

Clearly, I'm doing something wrong. (Maybe I'm not holding my lower lip right?

) I would love any guidance that anyone might have.

BalazsBarany · April 2017

A regular expression challenge? I'll bite ;-) Hopefully I can save a few hairs on your head.

It doesn't work directly with Split, but if you do a replace before, you can solve it.

The replace replaces the 4th, 8th etc. comma with ||, the split uses that for splitting.

Replace, search string: (([^,]+,){3}[^,]+),

Replace, replacement: $1||

Split, pattern: \|\|

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
        <list key="attribute_values">
          <parameter key="att1" value="&quot;1,2,3,4,5,6,7,8,9,10,11,12&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace" width="90" x="246" y="34">
        <parameter key="replace_what" value="(([^,]+,){3}[^,]+),"/>
        <parameter key="replace_by" value="$1||"/>
      </operator>
      <operator activated="true" class="split" compatibility="7.4.000" expanded="true" height="82" name="Split" width="90" x="447" y="34">
        <parameter key="split_pattern" value="\|\|"/>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_op="Split" to_port="example set input"/>
      <connect from_op="Split" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Greetings from Vienna,

Balázs

haddock · April 2011

Hi there Sgtrock,

Ah, the dreaded regex... Powerful but toxic. I use a specific tool, just because it is so easy to get it wrong. Here's what I mean.

Extra!

Extra!\sv*

Options: ^ and $ match at line breaks

Match the characters “Extra!” literally «Extra!»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»
Match the character “v” literally «v*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»

Created with RegexBuddy

And Extra! version 2

Extra! v*

Options: ^ and $ match at line breaks

Match the characters “Extra! ” literally «Extra! »
Match the character “v” literally «v*»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»

Created with RegexBuddy

And

Netware Client *

Options: ^ and $ match at line breaks

Match the characters “Netware Client” literally «Netware Client»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»

Created with RegexBuddy

Looks like it is those pesky *s ! Personally I wouldn't touch regex without a lower lip protector - life's too short!

Have fun..

8)

colo · April 2011

Hi sgtrock,

as far is I know RapidMiner indeed uses the Java RegEx engine.

For the syntax of the relevant Pattern class see Java's API-documentation: http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html - you could also take a look at their tutorial: http://download.oracle.com/javase/tutorial/essential/regex/index.html (another short and inofficial one: http://www.javaregex.com/tutorial.html).

If you want to use a supporting tool, I would recommend RegExBuddy (I think this is the tool haddock uses, but it's not free) or Expresso (free).

Your expression syntax using colons is new to me (don't know many RegEx flavours), but I guess this shall invoke a replace operation. Another thing that seems suspicious to me is that you are using the asterisk right after single characters as in v* which means a repetition of the letter 'v' (should be the same for every regex flavour). If you want to match anything after the 'v' you forgot the dot: v.*

In your special case I would suggest to use either the "Generate Extract" operator (if the version information has to be extracted and stored) and/or the "Replace" operator to clean up the software name. You can simply match the version information (it has to be the first capturing group, enclosed in round brackets) and replace the match by an empty string. For example replace "(\sv.*)" by an empty string to get rid of the Extra! version numbers. You can also use $n to re-insert matching groups in the replace-by pattern (where n is the number of the matching group, used instead of the common \n).

Hope this helps a little bit...

Regards
Matthias

sgtrock · April 2011

@colo;

Thank you for the links. I'll check them out.

As you surmised, the colon that I used was sort of a fake separator that was meant to represent what I was doing in the Map operator. When you look at the "Edit List" function you will see that each entry requires two items; the text that will be replaced and the replacing text. I attempted to use the colon to show the division between the two items for each entry.

The expression, v*, is defined in different ways in different engines. Many shell scripting languages and at least some dynamic languages will treat that expression as the letter 'v' and any other character string in any length. It's been a LONG time since I used perl and I've never programmed in Java, so I had forgotten about the nuance that you point out. I'll give that a shot.

@haddock;

Unfortunately, in this instance RegEx Buddy's suggestion for Extra! is wrong. That was the first thing that I tried.

Anyone else have any comments?

haddock · April 2011

Er no. RegexBuddy was interpreting your suggestion, and showing that it was not going to work.

sgtrock · April 2011

D'oh! Sorry, I misinterpreted what you said. :-[

haddock · April 2011

Nay probs, I normally talk gibberish

That said, regex gives great wiggle room in lots of RM scenarios, and Expresso or RegexBuddy allow you to play, and learn thereby.

lists · April 2017

I think a real cool thing would be to offer some useful examples in the RM-documentation.

I stuck with it since two hours now (Regex since 20 years).

As OP says everytime its implemented on another way.

matches(Attribute,"^(.*)WTF(.*)/s")..doesn't work next one...:manembarrassed:

Thomas_Ott · April 2017

In cases like this, I like to refer to this REGEX manual:

https://twitter.com/thepracticaldev/status/774327033757372416?lang=en

LOL.

Edin_Klapic · April 2017

Hi lists,

I highly recommend updating your RapidMiner since we introduced some help regarding RegEx in one of our latest versions.

For example for the Replace operator you have a button where you can test the RegEx and also have a short info on the syntax.

Best regards,

Edin

sgenzer · April 2017

agreed. I battle with various RegEx expressions all the time, and what works in one place does not work in RapidMiner and vice versa. This is my go-to link for RegEx in general but it does not always do the trick.

My current challenge (for anyone wanting a RegEx challenge!) is to get an expression that I can use in Split so that a nominal field will split every n occurences of a character, instead of every one. For example:

att1: 1,2,3,4,5,6,7,8,9,10,11,12

If I just use Split on RegEx [,], I of course get

att1_1 att1_2 att1_3 att1_4 ...

1 2 3 4 ...

But what I want is

att1_1 att1_2 att1_3

1,2,3,4 5,6,7,8 9,10,11,12

I am literally pulling my hair out on this one! I'm happy to contribute to a community RapidMiner RegEx database any time.

Scott

Edin_Klapic · April 2017

Great Balazs!

Just for completion: The reason why this does not work with a single Split is that Split removes the Pattern which matches the Expression.

I.e. even if you used the correct split pattern, the resulting attribute columns are mostly empty

Best,

Edin

sgenzer · April 2017

Balazs - that is very nice! Thank you! At my age all hairs on head are valuable. Beverage of choice is on me if you're in the neighborhood.

Scott

sgenzer · April 2017

and yes Edin you're right - Split always removes the selection. If I want to keep it, I have to do some wonky workarounds with Replace every time. Maybe a small improvement with Split would be a checkbox so that you could keep the selection and split directly before/after it?

Scott

lists · April 2017

I already have the new RM-version...but I saw the 'help-options' a little late. My fault.

Actually the example directly under the functions item in the list for regex-functions was enough, to give me the right direction, how it is implemented. The testing box of the replace operator is also great. Thank you RM-team!

So now the question of all questions...

Is there a function which uses regex and returns a needle in a haystack...not only true ore false?

matches() & find() don't do this. I know there is a text-extension, but is there a native function?

Thank you very much to all.

And no, I don't will read another regex-tutorial...I had to much of them over the last 20 years.

It's like alcohol. First it's funny, but then it drives you crazy.

.

BalazsBarany · April 2017

If you're interested in a match, use the common regular expression match syntax with the Replace operator.

Value: haystack hay hay needle hay hay hay

Regex: .*(needle).*

Replacement: $1

Will give you the needle in your attribute value. You can also use the replaceAll() function in Generate Attributes.

Here's a sample process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
        <list key="attribute_values">
          <parameter key="haystack" value="&quot;haystack hay hay needle hay hay whatever&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="7.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="34">
        <list key="function_descriptions">
          <parameter key="original" value="haystack"/>
          <parameter key="needle with Generate Attributes" value="replaceAll(haystack, &quot;.*(needle).*&quot;, &quot;$1&quot;)"/>
        </list>
      </operator>
      <operator activated="true" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace" width="90" x="447" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="haystack"/>
        <parameter key="replace_what" value=".*(needle).*"/>
        <parameter key="replace_by" value="$1"/>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

lists · April 2017

mistakenly doubled.

lists · April 2017

@BalazsBarany Thank you.

Acutally I need a variable needle.

Something like

 replaceAll(haystack, "(hay)(.*)(hay)", "$1")

returns not needle alone....hm whre are the individual retruns stored and why should I use a replace?

I mean...

(hay)(.*)(hay)

...matches. But how to access matching group 2 in RM to write a new value?

In PHP for example, I get an array of all matching groups as return.

Needle I need you!

https://www.youtube.com/watch?v=rNS6D4hSQdA

BalazsBarany · April 2017

RapidMiner is not PHP.

Here, you use $1, $2, $3 etc. to refer to the matched elements.

If you need a variable needle, you can use macros (for the Replace operator) or the attribute value in Generate Attributes.

lists · April 2017

I thought so...but needle is not in $1-$3.

Edin_Klapic · April 2017

Hi,

how about the following:

.*(hay) ([^h]\S*) (hay).*

Explanation:

The second capturing group (reflected by $2) starts with a character which is not h and uses only non-whitespace (blanks) characters. It therefore only matches "needle".

A shorter version would be .* ([^h]\S*) .* which can be reflected by $1.

Hope this helps,

Edin

lists · April 2017

@Edin_Klapic

That's interesting.

On this way replacing makes sense, though it's no real matching and it's some how not so flexible if I have different longer text. Anyways, now I understand the box "Replacement (for preview only)".

Thank you very much edin, I will play with it.

Meanwhile I started a new thread here...

http://community.rapidminer.com/t5/RapidMiner-Studio/Regular-Expressions-II-Needle-I-need-You/m-p/37749#M26007

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Regular expressions?"

Best Answer

Answers