The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

"Regular expressions?"

sgtrocksgtrock Member Posts: 17 Contributor II
edited May 2019 in Help
I've been messing around with text processing for log analysis for almost 30 years.  I've used a variety of languages in that time.

One of the more painful facts that I've had to learn is that nobody implements regular expressions in quite the same way.  I'm tripping over this yet again with RapidMiner and it's becoming a real source of frustration for me.  The manual is silent on the subject.  (Given how critical this function is to data analysis, I found that lack to be disturbing to say the least!).

Does anyone out there know of a good resource for creating regex in RapidMiner?  Searching the forum archives repeatedly turns up references to Java's regular expression documentation.  That in turn refers back to _perl's_ regex documentation, with exceptions noted.

None of that documenation tells me how RapidMiner 5.0 actually interprets regexes.  Attempting to find the right syntax is chewing up a great deal of my time.  Is there's a broad set of examples out there?  Or a truly thorough discussion of how to properly define regexes within the GUI?  I'd love to read something like that.

For example, take my current struggle.  I'm wading through a long list of software that was entered in by several different people over the years.  In addition, the rules for what was entered where and when changed as the database grew. 

The first thing that I want to do is simply count the number of versions of software that's out there.  Unfortunately, a lot of the old entries include the version number as part of the asset name, so I have to strip that out.  With me so far?

Here's a typical example (all examples from the same attribute, Asset):
Illustrator
Illustrator CS
Illustrator CS2
Illustrator CS3
Illustrator CS4
Dreamweaver
Dreamweaver CS3
Dreamweaver CS4
Photoshop
Photoshop CS
Photoshop CS4
In this instance, I'd love to just look for CS and CS[2-4] and strip them off the entries.

And another:
Extra! v6.7
Extra! v9
And another:

Netware Client v4.9 SP1 (IP)
Netware Client v4.9 SP1 (IP/IPX)
Netware Client 4
There's a lot more where these came from.  At the moment, I'm tackling this by going through the list, adding one entry at a time to a Map operator.  It's a painful, trial and error process at this point.

The issue that I'm struggling with is that I can't predict what end result I'm going to get.  For example, this works:
Illustrator CS[2-4]   :  Illustrator
but naturally doesn't eliminate the Illustrator CS entry.  I had to created a separate Map entry for it.  However, now I need to do the same for all the other programs in the Adobe Creative Suite.

This doesn't:
Extra!\sv*  :  Extra!
Neither does this:
Extra! v*  :  Extra!
Nor does this:
Netware Client *   :  Netware Client
Clearly, I'm doing something wrong.  (Maybe I'm not holding my lower lip right?  :D)  I would love any guidance that anyone might have.
Tagged:

Best Answer

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted

    A regular expression challenge? I'll bite ;-) Hopefully I can save a few hairs on your head.

     

    It doesn't work directly with Split, but if you do a replace before, you can solve it.

    The replace replaces the 4th, 8th etc. comma with ||, the split uses that for splitting.

    Replace, search string: (([^,]+,){3}[^,]+),

    Replace, replacement: $1||

    Split, pattern: \|\|

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
    <list key="attribute_values">
    <parameter key="att1" value="&quot;1,2,3,4,5,6,7,8,9,10,11,12&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace" width="90" x="246" y="34">
    <parameter key="replace_what" value="(([^,]+,){3}[^,]+),"/>
    <parameter key="replace_by" value="$1||"/>
    </operator>
    <operator activated="true" class="split" compatibility="7.4.000" expanded="true" height="82" name="Split" width="90" x="447" y="34">
    <parameter key="split_pattern" value="\|\|"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_op="Split" to_port="example set input"/>
    <connect from_op="Split" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    Greetings from Vienna,

    Balázs

Answers

  • haddockhaddock Member Posts: 849 Maven
    Hi there Sgtrock,

    Ah, the dreaded regex... Powerful but toxic. I use a specific tool, just because it is so easy to get it wrong. Here's what I mean.

    Extra!
    Extra!\sv*

    Options: ^ and $ match at line breaks

    Match the characters “Extra!” literally «Extra!»
    Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»
    Match the character “v” literally «v*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»


    Created with RegexBuddy
    And Extra! version 2
    Extra! v*

    Options: ^ and $ match at line breaks

    Match the characters “Extra! ” literally «Extra! »
    Match the character “v” literally «v*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»


    Created with RegexBuddy
    And
    Netware Client *

    Options: ^ and $ match at line breaks

    Match the characters “Netware Client” literally «Netware Client»
    Match the character “ ” literally « *»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»


    Created with RegexBuddy
    Looks like it is those pesky *s ! Personally I wouldn't touch regex without a lower lip protector - life's too short!

    Have fun..

    8)

  • colocolo Member Posts: 236 Maven
    Hi sgtrock,

    as far is I know RapidMiner indeed uses the Java RegEx engine.

    For the syntax of the relevant Pattern class see Java's API-documentation: http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html - you could also take a look at their tutorial: http://download.oracle.com/javase/tutorial/essential/regex/index.html (another short and inofficial one: http://www.javaregex.com/tutorial.html).

    If you want to use a supporting tool, I would recommend RegExBuddy (I think this is the tool haddock uses, but it's not free) or Expresso (free).

    Your expression syntax using colons is new to me (don't know many RegEx flavours), but I guess this shall invoke a replace operation. Another thing that seems suspicious to me is that you are using the asterisk right after single characters as in v* which means a repetition of the letter 'v' (should be the same for every regex flavour). If you want to match anything after the 'v' you forgot the dot: v.*

    In your special case I would suggest to use either the "Generate Extract" operator (if the version information has to be extracted and stored) and/or the "Replace" operator to clean up the software name. You can simply match the version information (it has to be the first capturing group, enclosed in round brackets) and replace the match by an empty string. For example replace "(\sv.*)" by an empty string to get rid of the Extra! version numbers. You can also use $n to re-insert matching groups in the replace-by pattern (where n is the number of the matching group, used instead of the common \n).

    Hope this helps a little bit...

    Regards
    Matthias
  • sgtrocksgtrock Member Posts: 17 Contributor II
    @colo;

    Thank you for the links.  I'll check them out.  

    As you surmised, the colon that I used was sort of a fake separator that was meant to represent what I was doing in the Map operator.  When you look at the "Edit List" function you will see that each entry requires two items;  the text that will be replaced and the replacing text.  I attempted to use the colon to show the division between the two items for each entry.

    The expression, v*, is defined in different ways in different engines.  Many shell scripting languages and at least some dynamic languages will treat that expression as the letter 'v' and any other character string in any length.  It's been a LONG time since I used perl and I've never programmed in Java, so I had forgotten about the nuance that you point out.  I'll give that a shot.

    @haddock;

    Unfortunately, in this instance RegEx Buddy's suggestion for Extra! is wrong.  That was the first thing that I tried.  :(

    Anyone else have any comments?
  • haddockhaddock Member Posts: 849 Maven
    Er no. RegexBuddy was interpreting your suggestion, and showing that it was not going to work.
  • sgtrocksgtrock Member Posts: 17 Contributor II
    D'oh! Sorry, I misinterpreted what you said. :-[
  • haddockhaddock Member Posts: 849 Maven
    Nay probs, I normally talk gibberish  :D  That said, regex gives great wiggle room in lots of RM scenarios,  and Expresso or RegexBuddy allow you to play, and learn thereby.
  • listslists Member Posts: 39 Maven

    I think a real cool thing would be to offer some useful examples in the RM-documentation.

    I stuck with it since two hours now (Regex since 20 years).

    As OP says everytime its implemented on another way.

    matches(Attribute,"^(.*)WTF(.*)/s")..doesn't work next one...:manembarrassed:

     

     

     

     

     

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    In cases like this, I like to refer to this REGEX manual:

     

    https://twitter.com/thepracticaldev/status/774327033757372416?lang=en

     

    LOL.

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist

    Hi lists,

     

    I highly recommend updating your RapidMiner since we introduced some help regarding RegEx in one of our latest versions.

    For example for the Replace operator you have a button where you can test the RegEx and also have a short info on the syntax.

     

    Best regards,

    Edin

     

    image.png

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    agreed.  I battle with various RegEx expressions all the time, and what works in one place does not work in RapidMiner and vice versa.  This is my go-to link for RegEx in general but it does not always do the trick.

     

    My current challenge (for anyone wanting a RegEx challenge!) is to get an expression that I can use in Split so that a nominal field will split every n occurences of a character, instead of every one.  For example:

     

    att1:       1,2,3,4,5,6,7,8,9,10,11,12

     

    If I just use Split on RegEx [,], I of course get

     

    att1_1     att1_2      att1_3      att1_4     ...

    1               2                3                4               ...

     

    But what I want is

    att1_1            att1_2            att1_3

    1,2,3,4           5,6,7,8            9,10,11,12

     

    I am literally pulling my hair out on this one!  I'm happy to contribute to a community RapidMiner RegEx database any time.

     

    Scott

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist

    Great Balazs!

     

    Just for completion: The reason why this does not work with a single Split is that Split removes the Pattern which matches the Expression.

    I.e. even if you used the correct split pattern, the resulting attribute columns are mostly empty :)

     

    Best,

    Edin

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Balazs - that is very nice!  Thank you!  At my age all hairs on head are valuable.  Beverage of choice is on me if you're in the neighborhood. :)

     

    Scott

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    and yes Edin you're right - Split always removes the selection.  If I want to keep it, I have to do some wonky workarounds with Replace every time.  Maybe a small improvement with Split would be a checkbox so that you could keep the selection and split directly before/after it?

     

    Scott

  • listslists Member Posts: 39 Maven

    I already have the new RM-version...but I saw the 'help-options' a little late. My fault.

    Actually the example directly under the functions item in the list for regex-functions was enough, to give me the right direction, how it is implemented. The testing box of the replace operator is also great. Thank you RM-team!

     

    So now the question of all questions...

     

    Is there a function which uses regex and returns a needle in a haystack...not only true ore false? 

    matches() & find() don't do this. I know there is a text-extension, but is there a native function?

     

    Thank you very much to all.

    And no, I don't will read another regex-tutorial...I had to much of them over the last 20 years.

    It's like alcohol. First it's funny, but then it drives you crazy.

    :heart:.

     

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    If you're interested in a match, use the common regular expression match syntax with the Replace operator.

     

    Value: haystack hay hay needle hay hay hay

     

    Regex: .*(needle).*

    Replacement: $1

     

    Will give you the needle in your attribute value. You can also use the replaceAll() function in Generate Attributes.

     

    Here's a sample process:

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="haystack" value="&quot;haystack hay hay needle hay hay whatever&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.4.000" expanded="true" height="82" name="Generate Attributes" width="90" x="246" y="34">
    <list key="function_descriptions">
    <parameter key="original" value="haystack"/>
    <parameter key="needle with Generate Attributes" value="replaceAll(haystack, &quot;.*(needle).*&quot;, &quot;$1&quot;)"/>
    </list>
    </operator>
    <operator activated="true" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="haystack"/>
    <parameter key="replace_what" value=".*(needle).*"/>
    <parameter key="replace_by" value="$1"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • listslists Member Posts: 39 Maven

    mistakenly doubled.

  • listslists Member Posts: 39 Maven

    @BalazsBarany Thank you.

     

    Acutally I need a variable needle.

    Something like

     replaceAll(haystack, "(hay)(.*)(hay)", "$1")

    returns not needle alone....hm whre are the individual retruns stored and why should I use a replace?

    I mean... 

    (hay)(.*)(hay)

    ...matches. But how to access matching group 2 in RM to write a new value?

    In PHP for example, I get an array of all matching groups as return.

     

    needle.png

     

    Needle I need you!

    https://www.youtube.com/watch?v=rNS6D4hSQdA

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    RapidMiner is not PHP. 

     

    Here, you use $1, $2, $3 etc. to refer to the matched elements.

     

    If you need a variable needle, you can use macros (for the Replace operator) or the attribute value in Generate Attributes.

  • listslists Member Posts: 39 Maven

    I thought so...but needle is not in $1-$3.

  • Edin_KlapicEdin_Klapic Moderator, Employee, RMResearcher, Member Posts: 299 RM Data Scientist

    Hi,

     

    how about the following:

     

    .*(hay) ([^h]\S*) (hay).*

    Explanation:

    The second capturing group (reflected by $2) starts with a character which is not h and uses only non-whitespace (blanks) characters. It therefore only matches "needle".

    A shorter version would be .* ([^h]\S*) .* which can be reflected by $1.

     

    Hope this helps,

    Edin

     

    image.png

  • listslists Member Posts: 39 Maven

    @Edin_Klapic

     

    That's interesting.

     

    On this way replacing makes sense, though it's no real matching and it's some how not so flexible if I have different longer text. Anyways, now I understand the box "Replacement (for preview only)". 

     

    Thank you very much edin, I will play with it.

     

    Meanwhile I started a new thread here...

     

    http://community.rapidminer.com/t5/RapidMiner-Studio/Regular-Expressions-II-Needle-I-need-You/m-p/37749#M26007

     

     

     

     

     

Sign In or Register to comment.