Options

[SOLVED] Splitting nominal attribute values by unparenthesized commas

tennenrishintennenrishin Member Posts: 177 Contributor II
edited November 2018 in Help
Hi

I would like to split a nominal attribute into multiple attributes. The nominal values need to be split by all the internal commas, except for those commas that are inside parentheses. The same way one would split a function argument list into the arguments (which may themselves contain function calls).

Does anyone have any ideas for what regex I could use to match those commas, or any other way to perform this split?

Answers

  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi, you can't model recursive structures with regular expressions - if you have a fixed depth, however, creating a regular expression will be possible.
  • Options
    tennenrishintennenrishin Member Posts: 177 Contributor II
    Thank you.

    I'm thinking of...
    1. Replacing all commas and parentheses within substrings that match \([^\(\)]*\) by respective special tokens.
    2. Repeating step 1 until there are no more parentheses (or simply for max_depth number of times)
    3. Splitting by commas
    4. Replacing those special tokens back with their original characters again.

    But step 1 requires a capability to search and replace within all substrings that match some given regex. Is there a way to do this?

    Any help appreciated.
  • Options
    tennenrishintennenrishin Member Posts: 177 Contributor II
    Disregarding my last post, I'm now trying this regex
    ,(?!([^\(\)]*\(([^\(\)]*\(([^\(\)]*\([^\(\)]*\))*[^\(\)]*\))*[^\(\)]*\))*[^\(\)]*\))
    with the assumption that nesting does not exceed a depth of 3 levels.

    It seems to be working but of course it is not easy to test comprehensively. Can you spot any obvious mistakes? Is it unnecessarily complicated?

    Here is a more readable version:

    ,(?!
    (
    [^\(\)]*
    \(
    (
    [^\(\)]*
    \(
    (
    [^\(\)]*
    \(
    [^\(\)]*
    \)
    )*
    [^\(\)]*
    \)
    )*
    [^\(\)]*
    \)
    )*
    [^\(\)]*
    \)
    )
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Quite possible that it works like this - if it works, then it works ;) Maybe you can simplify the expression itself, if you add some process logic like loops and Branches around it, as you proposed in your previous post.

    Best,
    Marius
Sign In or Register to comment.