function behaviour with replaceAll()

kaymankayman Member Posts: 662 Unicorn
edited November 2018 in Help

When using the replaceAll operator it seems some functions are ignored while other seem to work fine.

 

As an example :

 

replaceAll(lower([myField]),"^(.)",upper("$1")) just returns the same, whereas the expected behaviour would be to get the first character being returned in upper case. There is no error thrown, the upper (and also lower) command is just ignored when applying it to the regex result.

 

replaceAll([myField],"^(.)",concat("-","$1","-")) nicely returns a concatenated field, as expected. So here the function works nicely with the regex match.

 

Any idea why?

 

(PS : I'm aware I can get the wanted result with other functions also, but that would only work for the simplified example as my actual regex is a bit more complex)

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @kayman - that is one nasty RegEx you are building there in Generate Attributes :)

     

    I have no idea why you would see that strange behavior BUT if it were me, I would build that expression in three Generate Attributes operators rather than in one nasty formula:

     

    att1   lower([myField])

    att2   upper("$1")             [--- not sure what this does...upper case $? ---]

    att3   replaceAll(att1,"^(.)",att2)

     

    Scott

     

     

     

  • kaymankayman Member Posts: 662 Unicorn

    Trust me, that's not a nasty one :-)

     

    What I want to achieve is to camel case some uppercased content in a given string, so the attribute flow will not work.

    Example :

     

    Attr : This is a STRING

    Should become

    Attr : This is a String

     

    Now, if there were only a few words like this I could deal with replacing them one by one, but there are a load of them. So in essence I want to be able to replace some defined uppercased words to lower (or camel) case, but definitly not all of them

     

    Getting the words in question is fairly simple, that would be something like

     

    replaceAll([myAttr],"(WORD1|WORD2|WORD3)", [replaceWithLogic])

    The $1 operator is simply my matched word, so for instance WORD1

     

    Where the replace logic could be something like concat(prefix("$1"),lower(suffix("$1",len("$1")-1))

    Or take the first char and leave as is, and everything else to lower case. Should work in theory but the operator happilly ignores everything and just returns the value as is.

     

    So instead of getting expected Word1, it produces WORD1WORD1.

     

    In standard regex you could also use something as (W)(ORD1) and replace this with $1\L$2 to receive Word1 but this syntax is not supported either in the normal regex replacements. 

     

    The behaviour is not consistent, some functions are dealing correct with the matched group ($x), others do not. But it also doesn't fail as such, it just does not get handled.

     

    Hope this makes some sense...

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    yes it makes sense. Huh. I have a feeling there is a much easier way to do this but my mind is a blank. One thing that I hope(?) you know is that the RegEx engine is different in Java than in JavaScript. I accidentally discovered this a while back when I was using online RegEx builders and then found RapidMiner being wonky. @Telcontar120 got me onto this book "Regular Expressions in 10 Minutes" by Ben Forta which does a good job for me.

     

    Can you post a sample of the data set and your process so I can play with it? I love this kind of stuff.


    Scott

     

     

     

  • kaymankayman Member Posts: 662 Unicorn

    There are alternatives indeed, but they can end up pretty hard to maintain in the end, problem with my content is that sometimes terms are in uppercase, and other times in camel and I need them to be consistent in the end. What I do now is use a replace by document flow, like below

     

    section,from,to

    Generic,\\bSPORT\\b,Sport
    Generic,\\bCINEMA\\b,Cinema
    Generic,\\bGAME\\b,Game

    and many more...

     

    This works pretty fine and is a reasonable alternative, but some things can be done easier (ok, the regex can become scary then) but it doesn't really work as expected. I know and understand there are differences between the various flavours, but that is not the real issue in this scenario, since there is a match and the regex itself is valid. The problem is that the operator behaves different with the output based on the function used.

     

    Not sure what the format behind the scenes the result is returned, guess the problem is that the operator recognizes that there is a result, but not what format it is.

     

    Below is a sample, it is a bit simple but shows that the script is at least working for some functions.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.6.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="179" y="34">
    <list key="attribute_values">
    <parameter key="Attr" value="&quot;This is a TEST to camelcase some WORDS&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.6.001" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="34">
    <list key="function_descriptions">
    <parameter key="attr2" value="replaceAll(Attr,&quot;\\b(TEST|WORDS)\\b&quot;,concat(&quot; - &quot;,&quot;$1&quot;,&quot; - &quot;))"/>
    <parameter key="attr3" value="replaceAll(Attr,&quot;\\b(TEST|WORDS)\\b&quot;,lower(&quot;$1&quot;))"/>
    </list>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="87" resized="true" width="516" x="188" y="169">doesn't work : replaceAll(Attr,&amp;quot;\\b(TEST|WORDS)\\b&amp;quot;,length(&amp;quot;$1&amp;quot;))</description>
    </process>
    </operator>
    </process>
Sign In or Register to comment.