The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
"Regular expressions?"
I've been messing around with text processing for log analysis for almost 30 years. I've used a variety of languages in that time.
One of the more painful facts that I've had to learn is that nobody implements regular expressions in quite the same way. I'm tripping over this yet again with RapidMiner and it's becoming a real source of frustration for me. The manual is silent on the subject. (Given how critical this function is to data analysis, I found that lack to be disturbing to say the least!).
Does anyone out there know of a good resource for creating regex in RapidMiner? Searching the forum archives repeatedly turns up references to Java's regular expression documentation. That in turn refers back to _perl's_ regex documentation, with exceptions noted.
None of that documenation tells me how RapidMiner 5.0 actually interprets regexes. Attempting to find the right syntax is chewing up a great deal of my time. Is there's a broad set of examples out there? Or a truly thorough discussion of how to properly define regexes within the GUI? I'd love to read something like that.
For example, take my current struggle. I'm wading through a long list of software that was entered in by several different people over the years. In addition, the rules for what was entered where and when changed as the database grew.
The first thing that I want to do is simply count the number of versions of software that's out there. Unfortunately, a lot of the old entries include the version number as part of the asset name, so I have to strip that out. With me so far?
Here's a typical example (all examples from the same attribute, Asset):
And another:
The issue that I'm struggling with is that I can't predict what end result I'm going to get. For example, this works:
This doesn't:
One of the more painful facts that I've had to learn is that nobody implements regular expressions in quite the same way. I'm tripping over this yet again with RapidMiner and it's becoming a real source of frustration for me. The manual is silent on the subject. (Given how critical this function is to data analysis, I found that lack to be disturbing to say the least!).
Does anyone out there know of a good resource for creating regex in RapidMiner? Searching the forum archives repeatedly turns up references to Java's regular expression documentation. That in turn refers back to _perl's_ regex documentation, with exceptions noted.
None of that documenation tells me how RapidMiner 5.0 actually interprets regexes. Attempting to find the right syntax is chewing up a great deal of my time. Is there's a broad set of examples out there? Or a truly thorough discussion of how to properly define regexes within the GUI? I'd love to read something like that.
For example, take my current struggle. I'm wading through a long list of software that was entered in by several different people over the years. In addition, the rules for what was entered where and when changed as the database grew.
The first thing that I want to do is simply count the number of versions of software that's out there. Unfortunately, a lot of the old entries include the version number as part of the asset name, so I have to strip that out. With me so far?
Here's a typical example (all examples from the same attribute, Asset):
IllustratorIn this instance, I'd love to just look for CS and CS[2-4] and strip them off the entries.
Illustrator CS
Illustrator CS2
Illustrator CS3
Illustrator CS4
Dreamweaver
Dreamweaver CS3
Dreamweaver CS4
Photoshop
Photoshop CS
Photoshop CS4
And another:
Extra! v6.7And another:
Extra! v9
There's a lot more where these came from. At the moment, I'm tackling this by going through the list, adding one entry at a time to a Map operator. It's a painful, trial and error process at this point.
Netware Client v4.9 SP1 (IP)
Netware Client v4.9 SP1 (IP/IPX)
Netware Client 4
The issue that I'm struggling with is that I can't predict what end result I'm going to get. For example, this works:
Illustrator CS[2-4] : Illustratorbut naturally doesn't eliminate the Illustrator CS entry. I had to created a separate Map entry for it. However, now I need to do the same for all the other programs in the Adobe Creative Suite.
This doesn't:
Extra!\sv* : Extra!Neither does this:
Extra! v* : Extra!Nor does this:
Netware Client * : Netware ClientClearly, I'm doing something wrong. (Maybe I'm not holding my lower lip right? ) I would love any guidance that anyone might have.
Tagged:
0
Best Answer
-
BalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
A regular expression challenge? I'll bite ;-) Hopefully I can save a few hairs on your head.
It doesn't work directly with Split, but if you do a replace before, you can solve it.
The replace replaces the 4th, 8th etc. comma with ||, the split uses that for splitting.
Replace, search string: (([^,]+,){3}[^,]+),
Replace, replacement: $1||
Split, pattern: \|\|
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
<list key="attribute_values">
<parameter key="att1" value=""1,2,3,4,5,6,7,8,9,10,11,12""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace" width="90" x="246" y="34">
<parameter key="replace_what" value="(([^,]+,){3}[^,]+),"/>
<parameter key="replace_by" value="$1||"/>
</operator>
<operator activated="true" class="split" compatibility="7.4.000" expanded="true" height="82" name="Split" width="90" x="447" y="34">
<parameter key="split_pattern" value="\|\|"/>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_op="Split" to_port="example set input"/>
<connect from_op="Split" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>Greetings from Vienna,
Balázs
1
Answers
Ah, the dreaded regex... Powerful but toxic. I use a specific tool, just because it is so easy to get it wrong. Here's what I mean.
Extra! And Extra! version 2 And Looks like it is those pesky *s ! Personally I wouldn't touch regex without a lower lip protector - life's too short!
Have fun..
8)
as far is I know RapidMiner indeed uses the Java RegEx engine.
For the syntax of the relevant Pattern class see Java's API-documentation: http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html - you could also take a look at their tutorial: http://download.oracle.com/javase/tutorial/essential/regex/index.html (another short and inofficial one: http://www.javaregex.com/tutorial.html).
If you want to use a supporting tool, I would recommend RegExBuddy (I think this is the tool haddock uses, but it's not free) or Expresso (free).
Your expression syntax using colons is new to me (don't know many RegEx flavours), but I guess this shall invoke a replace operation. Another thing that seems suspicious to me is that you are using the asterisk right after single characters as in v* which means a repetition of the letter 'v' (should be the same for every regex flavour). If you want to match anything after the 'v' you forgot the dot: v.*
In your special case I would suggest to use either the "Generate Extract" operator (if the version information has to be extracted and stored) and/or the "Replace" operator to clean up the software name. You can simply match the version information (it has to be the first capturing group, enclosed in round brackets) and replace the match by an empty string. For example replace "(\sv.*)" by an empty string to get rid of the Extra! version numbers. You can also use $n to re-insert matching groups in the replace-by pattern (where n is the number of the matching group, used instead of the common \n).
Hope this helps a little bit...
Regards
Matthias
Thank you for the links. I'll check them out.
As you surmised, the colon that I used was sort of a fake separator that was meant to represent what I was doing in the Map operator. When you look at the "Edit List" function you will see that each entry requires two items; the text that will be replaced and the replacing text. I attempted to use the colon to show the division between the two items for each entry.
The expression, v*, is defined in different ways in different engines. Many shell scripting languages and at least some dynamic languages will treat that expression as the letter 'v' and any other character string in any length. It's been a LONG time since I used perl and I've never programmed in Java, so I had forgotten about the nuance that you point out. I'll give that a shot.
@haddock;
Unfortunately, in this instance RegEx Buddy's suggestion for Extra! is wrong. That was the first thing that I tried.
Anyone else have any comments?
I think a real cool thing would be to offer some useful examples in the RM-documentation.
I stuck with it since two hours now (Regex since 20 years).
As OP says everytime its implemented on another way.
matches(Attribute,"^(.*)WTF(.*)/s")..doesn't work next one...:manembarrassed:
In cases like this, I like to refer to this REGEX manual:
https://twitter.com/thepracticaldev/status/774327033757372416?lang=en
LOL.
Hi lists,
I highly recommend updating your RapidMiner since we introduced some help regarding RegEx in one of our latest versions.
For example for the Replace operator you have a button where you can test the RegEx and also have a short info on the syntax.
Best regards,
Edin
agreed. I battle with various RegEx expressions all the time, and what works in one place does not work in RapidMiner and vice versa. This is my go-to link for RegEx in general but it does not always do the trick.
My current challenge (for anyone wanting a RegEx challenge!) is to get an expression that I can use in Split so that a nominal field will split every n occurences of a character, instead of every one. For example:
att1: 1,2,3,4,5,6,7,8,9,10,11,12
If I just use Split on RegEx [,], I of course get
att1_1 att1_2 att1_3 att1_4 ...
1 2 3 4 ...
But what I want is
att1_1 att1_2 att1_3
1,2,3,4 5,6,7,8 9,10,11,12
I am literally pulling my hair out on this one! I'm happy to contribute to a community RapidMiner RegEx database any time.
Scott
Great Balazs!
Just for completion: The reason why this does not work with a single Split is that Split removes the Pattern which matches the Expression.
I.e. even if you used the correct split pattern, the resulting attribute columns are mostly empty
Best,
Edin
Balazs - that is very nice! Thank you! At my age all hairs on head are valuable. Beverage of choice is on me if you're in the neighborhood.
Scott
and yes Edin you're right - Split always removes the selection. If I want to keep it, I have to do some wonky workarounds with Replace every time. Maybe a small improvement with Split would be a checkbox so that you could keep the selection and split directly before/after it?
Scott
I already have the new RM-version...but I saw the 'help-options' a little late. My fault.
Actually the example directly under the functions item in the list for regex-functions was enough, to give me the right direction, how it is implemented. The testing box of the replace operator is also great. Thank you RM-team!
So now the question of all questions...
Is there a function which uses regex and returns a needle in a haystack...not only true ore false?
matches() & find() don't do this. I know there is a text-extension, but is there a native function?
Thank you very much to all.
And no, I don't will read another regex-tutorial...I had to much of them over the last 20 years.
It's like alcohol. First it's funny, but then it drives you crazy.
.
If you're interested in a match, use the common regular expression match syntax with the Replace operator.
Value: haystack hay hay needle hay hay hay
Regex: .*(needle).*
Replacement: $1
Will give you the needle in your attribute value. You can also use the replaceAll() function in Generate Attributes.
Here's a sample process:
mistakenly doubled.
@BalazsBarany Thank you.
Acutally I need a variable needle.
Something like
returns not needle alone....hm whre are the individual retruns stored and why should I use a replace?
I mean...
...matches. But how to access matching group 2 in RM to write a new value?
In PHP for example, I get an array of all matching groups as return.
Needle I need you!
https://www.youtube.com/watch?v=rNS6D4hSQdA
RapidMiner is not PHP.
Here, you use $1, $2, $3 etc. to refer to the matched elements.
If you need a variable needle, you can use macros (for the Replace operator) or the attribute value in Generate Attributes.
I thought so...but needle is not in $1-$3.
Hi,
how about the following:
Explanation:
The second capturing group (reflected by $2) starts with a character which is not h and uses only non-whitespace (blanks) characters. It therefore only matches "needle".
A shorter version would be .* ([^h]\S*) .* which can be reflected by $1.
Hope this helps,
Edin
@Edin_Klapic
That's interesting.
On this way replacing makes sense, though it's no real matching and it's some how not so flexible if I have different longer text. Anyways, now I understand the box "Replacement (for preview only)".
Thank you very much edin, I will play with it.
Meanwhile I started a new thread here...
http://community.rapidminer.com/t5/RapidMiner-Studio/Regular-Expressions-II-Needle-I-need-You/m-p/37749#M26007