"[SOLVED] Separating CapitalizedString into single words"

monami555monami555 Member Posts: 16  Maven
edited June 10 in Help
Hi all

I have been trying for quite a long time  to solve the following problem but cannot find any way, maybe someone had a similar issue:

I have a set of examples which have attribute values like:

"CapitalizedStringIntoSingleWords"

but I want them in the form of "Capitalized String Into Single Words" (separate them by capital letter, I don't mind if the result words are capitalized or not). I could use Regular Expressions, I can easily filter out the capital letters, but then I get only something like:

" apitalized tring nto ingle ords"

... huh, that seems to be a more general problem, and I cannot get my thinking out of the box..  :-[
Any ideas, help? ???

Cheers,
Monika

Answers

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,642  RM Founder
    Hi,

    you can use the operator Replace with a reg exp and capturing groups like this:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.017">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
        <process expanded="true" height="145" width="413">
          <operator activated="true" class="generate_data_user_specification" compatibility="5.1.017" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="30">
            <list key="attribute_values">
              <parameter key="text" value="&quot;CapitalizedStringIntoSingleWords&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="replace" compatibility="5.1.017" expanded="true" height="76" name="Replace" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="text"/>
            <parameter key="replace_what" value="([A-Z])"/>
            <parameter key="replace_by" value=" $1"/>
          </operator>
          <operator activated="true" class="trim" compatibility="5.1.017" expanded="true" height="76" name="Trim" width="90" x="313" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="text"/>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Replace" to_port="example set input"/>
          <connect from_op="Replace" from_port="example set output" to_op="Trim" to_port="example set input"/>
          <connect from_op="Trim" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    The operator "Trim" is just for removing the first space if there is one. If this is not desired, you could also define the reg exp in a way that only Capitals not at the start of the line will be replaced. This looks like this:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.017">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
        <process expanded="true" height="145" width="413">
          <operator activated="true" class="generate_data_user_specification" compatibility="5.1.017" expanded="true" height="60" name="Generate Data by User Specification" width="90" x="45" y="30">
            <list key="attribute_values">
              <parameter key="text" value="&quot;CapitalizedStringIntoSingleWords&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="replace" compatibility="5.1.017" expanded="true" height="76" name="Replace" width="90" x="179" y="30">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="text"/>
            <parameter key="replace_what" value="((?&lt;!^)[A-Z])"/>
            <parameter key="replace_by" value=" $1"/>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Replace" to_port="example set input"/>
          <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    The reg exp for the parameter "replace what" is '((?<!^)[A-Z])' (without the quotes), and the reg exp for "replace by" is ' $1' (note the leading space).

    Hope that helps,
    Ingo
  • monami555monami555 Member Posts: 16  Maven
    Ahhhhhh $1 is what I was missing :-) Thanks!
  • monami555monami555 Member Posts: 16  Maven
    Well, I still have the problem. As soon as I put the $1 anywhere in the "replace by" I get the following exception:

    Exception: java.lang.IndexOutOfBoundsException
    Message: No group 1
    Stack trace:

      java.util.regex.Matcher.group(Unknown Source)
      java.util.regex.Matcher.appendReplacement(Unknown Source)
      java.util.regex.Matcher.replaceAll(Unknown Source)
      com.rapidminer.operator.preprocessing.filter.AttributeValueReplace.applyOnFiltered(AttributeValueReplace.java:114)
      com.rapidminer.operator.preprocessing.filter.AbstractFilteredDataProcessing.apply(AbstractFilteredDataProcessing.java:136)

    I try it for the the single attribute value "CompanyEarningsAnnouncement", so there should not be problems...

    This is a simplified example process that throws the same exception:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.017">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true" height="566" width="882">
          <operator activated="true" class="text:create_document" compatibility="5.1.004" expanded="true" height="60" name="Create Document" width="90" x="45" y="30">
            <parameter key="text" value="CompanyEarningsAnnouncement"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:documents_to_data" compatibility="5.1.004" expanded="true" height="76" name="Documents to Data" width="90" x="45" y="120">
            <parameter key="text_attribute" value="subClass"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
          </operator>
          <operator activated="true" class="replace" compatibility="5.1.017" expanded="true" height="76" name="Replace (2)" width="90" x="45" y="210">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value="subClass"/>
            <parameter key="attributes" value="|subClass|superClass"/>
            <parameter key="regular_expression" value="#*"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="replace_what" value=".*#"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="5.1.017" expanded="true" height="76" name="Filter Examples" width="90" x="179" y="210">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="subClass=CompanyEarningsAnnouncement"/>
            <parameter key="invert_filter" value="false"/>
          </operator>
          <operator activated="true" class="replace" compatibility="5.1.017" expanded="true" height="76" name="Replace" width="90" x="179" y="120">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="subClass"/>
            <parameter key="attributes" value=""/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="nominal"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="file_path"/>
            <parameter key="block_type" value="single_value"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="single_value"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <parameter key="replace_what" value="[A-Z]"/>
            <parameter key="replace_by" value=" $1"/>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_op="Replace (2)" to_port="example set input"/>
          <connect from_op="Replace (2)" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Replace" to_port="example set input"/>
          <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    What could be the reason??
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,642  RM Founder
    Hi again,

    you have deleted the most important part of my solution: the brackets "(" and ")" have indicated a so-called capturing group which can be re-used in "replace by" with $X where X denotes the number of the group. Just use "([A-Z])" (will introduce leading space which could be removed by trim as in my first process above) or "((?<!^)[A-Z])" (will not introduce leading space as in my second process above) and you will be fine again.

    Fore more information about caputuring groups please check out Section 3.4 of the following tutorial:

    http://www.vogella.de/articles/JavaRegularExpressions/article.html#regex_grouping

    Hope that helps,
    Ingo
  • monami555monami555 Member Posts: 16  Maven
    That's right, sorry for not reading carefully enough.
    Thank you :)
Sign In or Register to comment.