RapidMiner

Regular expressions?

SOLVED
Regular Contributor

Regular expressions?

I've been messing around with text processing for log analysis for almost 30 years.  I've used a variety of languages in that time.

One of the more painful facts that I've had to learn is that nobody implements regular expressions in quite the same way.  I'm tripping over this yet again with RapidMiner and it's becoming a real source of frustration for me.  The manual is silent on the subject.  (Given how critical this function is to data analysis, I found that lack to be disturbing to say the least!).

Does anyone out there know of a good resource for creating regex in RapidMiner?  Searching the forum archives repeatedly turns up references to Java's regular expression documentation.  That in turn refers back to _perl's_ regex documentation, with exceptions noted.

None of that documenation tells me how RapidMiner 5.0 actually interprets regexes.  Attempting to find the right syntax is chewing up a great deal of my time.  Is there's a broad set of examples out there?  Or a truly thorough discussion of how to properly define regexes within the GUI?  I'd love to read something like that.

For example, take my current struggle.  I'm wading through a long list of software that was entered in by several different people over the years.  In addition, the rules for what was entered where and when changed as the database grew. 

The first thing that I want to do is simply count the number of versions of software that's out there.  Unfortunately, a lot of the old entries include the version number as part of the asset name, so I have to strip that out.  With me so far?

Here's a typical example (all examples from the same attribute, Asset):
Illustrator
Illustrator CS
Illustrator CS2
Illustrator CS3
Illustrator CS4
Dreamweaver
Dreamweaver CS3
Dreamweaver CS4
Photoshop
Photoshop CS
Photoshop CS4


In this instance, I'd love to just look for CS and CS[2-4] and strip them off the entries.

And another:
Extra! v6.7
Extra! v9


And another:

Netware Client v4.9 SP1 (IP)
Netware Client v4.9 SP1 (IP/IPX)
Netware Client 4


There's a lot more where these came from.  At the moment, I'm tackling this by going through the list, adding one entry at a time to a Map operator.  It's a painful, trial and error process at this point.

The issue that I'm struggling with is that I can't predict what end result I'm going to get.  For example, this works:

Illustrator CS[2-4]   :  Illustrator

but naturally doesn't eliminate the Illustrator CS entry.  I had to created a separate Map entry for it.  However, now I need to do the same for all the other programs in the Adobe Creative Suite.

This doesn't:
Extra!\sv*  :  Extra!


Neither does this:
Extra! v*  :  Extra!


Nor does this:
Netware Client *   :  Netware Client


Clearly, I'm doing something wrong.  (Maybe I'm not holding my lower lip right?  Smiley Very Happy)  I would love any guidance that anyone might have.
22 REPLIES
Regular Contributor

Re: Regular expressions?

Hi there Sgtrock,

Ah, the dreaded regex... Powerful but toxic. I use a specific tool, just because it is so easy to get it wrong. Here's what I mean.

Extra!

Extra!\sv*

Options: ^ and $ match at line breaks

Match the characters “Extra!” literally «Extra!»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»
Match the character “v” literally «v*»
  Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»


Created with RegexBuddy


And Extra! version 2

Extra! v*

Options: ^ and $ match at line breaks

Match the characters “Extra! ” literally «Extra! »
Match the character “v” literally «v*»
  Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»


Created with RegexBuddy


And

Netware Client *

Options: ^ and $ match at line breaks

Match the characters “Netware Client” literally «Netware Client»
Match the character “ ” literally « *»
  Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»


Created with RegexBuddy


Looks like it is those pesky *s ! Personally I wouldn't touch regex without a lower lip protector - life's too short!

Have fun..

8)

Regular Contributor

Re: Regular expressions?

Hi sgtrock,

as far is I know RapidMiner indeed uses the Java RegEx engine.

For the syntax of the relevant Pattern class see Java's API-documentation: http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html - you could also take a look at their tutorial: http://download.oracle.com/javase/tutorial/essential/regex/index.html (another short and inofficial one: http://www.javaregex.com/tutorial.html).

If you want to use a supporting tool, I would recommend RegExBuddy (I think this is the tool haddock uses, but it's not free) or Expresso (free).

Your expression syntax using colons is new to me (don't know many RegEx flavours), but I guess this shall invoke a replace operation. Another thing that seems suspicious to me is that you are using the asterisk right after single characters as in v* which means a repetition of the letter 'v' (should be the same for every regex flavour). If you want to match anything after the 'v' you forgot the dot: v.*

In your special case I would suggest to use either the "Generate Extract" operator (if the version information has to be extracted and stored) and/or the "Replace" operator to clean up the software name. You can simply match the version information (it has to be the first capturing group, enclosed in round brackets) and replace the match by an empty string. For example replace "(\sv.*)" by an empty string to get rid of the Extra! version numbers. You can also use $n to re-insert matching groups in the replace-by pattern (where n is the number of the matching group, used instead of the common \n).

Hope this helps a little bit...

Regards
Matthias
Regular Contributor

Re: Regular expressions?

@colo;

Thank you for the links.  I'll check them out.  

As you surmised, the colon that I used was sort of a fake separator that was meant to represent what I was doing in the Map operator.  When you look at the "Edit List" function you will see that each entry requires two items;  the text that will be replaced and the replacing text.  I attempted to use the colon to show the division between the two items for each entry.

The expression, v*, is defined in different ways in different engines.  Many shell scripting languages and at least some dynamic languages will treat that expression as the letter 'v' and any other character string in any length.  It's been a LONG time since I used perl and I've never programmed in Java, so I had forgotten about the nuance that you point out.  I'll give that a shot.

@haddock;

Unfortunately, in this instance RegEx Buddy's suggestion for Extra! is wrong.  That was the first thing that I tried.  Smiley Sad

Anyone else have any comments?
Regular Contributor

Re: Regular expressions?

Er no. RegexBuddy was interpreting your suggestion, and showing that it was not going to work.
Regular Contributor

Re: Regular expressions?

D'oh! Sorry, I misinterpreted what you said. :-[
Regular Contributor

Re: Regular expressions?

Nay probs, I normally talk gibberish  Smiley Very Happy  That said, regex gives great wiggle room in lots of RM scenarios,  and Expresso or RegexBuddy allow you to play, and learn thereby.
Regular Contributor

Re: Regular expressions?

[ Edited ]

I think a real cool thing would be to offer some useful examples in the RM-documentation.

I stuck with it since two hours now (Regex since 20 years).

As OP says everytime its implemented on another way.

matches(Attribute,"^(.*)WTF(.*)/s")..doesn't work next one...Man Embarassed

 

 

 

 

 

Community Manager

Re: Regular expressions?

In cases like this, I like to refer to this REGEX manual:

 

https://twitter.com/thepracticaldev/status/774327033757372416?lang=en

 

LOL.

Regards,
T-Bone
Twitter: @neuralmarket
Highlighted
RMStaff

Re: Regular expressions?

Hi lists,

 

I highly recommend updating your RapidMiner since we introduced some help regarding RegEx in one of our latest versions.

For example for the Replace operator you have a button where you can test the RegEx and also have a short info on the syntax.

 

Best regards,

Edin

 

image.png