Search for specific pattern within strings of characters

komal_chenthamakomal_chenthama Member Posts: 3 Contributor I
edited December 2018 in Help

Hi,

 

I have a table with 1000s of rows, with long string of characters without spaces. An example of such a single row is below:

 

"MKFFAAAALFATSAMAAVCPDGGLFSNPLCCSSILLEAVGLDCTTPTAPVVTAGLFQANCASIGKQPACCVAPLAGQGILCNNPAGT"

 

I would like to filter out all the rows in my table that have following pattern C...CC...C..C..CC..C, where "." represents any character any number of times. Could anyone kindly suggest an operator or combinaton of operators for this task?

 

Tagged:

Answers

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 397   Unicorn

    Hi @komal_chenthama

     

    The Filter Examples operator allows you to match by a regular expression. I can build one to give you an example. In your question, C means... the character "C" or it can be any character?

     

    @sgenzer (do you mind if I write a tutorial on regular expressions with RapidMiner, to be included in the RapidMiner documentation?)

     

    All the best,

     

    Rodrigo.

    Telcontar120mschmitz
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,111  RM Data Scientist

    Hi @komal_chenthama,

     

    Filter Examples with expression and matches(..) is the way to go. Attached is an example process.


    ~Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="8.2.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="136">
    <list key="attribute_values">
    <parameter key="string" value="&quot;MKFFAAAALFATSAMAAVCPDGGLFSNPLCCSSILLEAVGLDCTTPTAPVVTAGLFQANCASIGKQPACCVAPLAGQGILCNNPAGT&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples" width="90" x="246" y="136">
    <parameter key="parameter_expression" value="matches(string,&quot;C.+.+.+CC.+.+.+C.+.+C.+.+CC.+.+C&quot;)"/>
    <parameter key="condition_class" value="expression"/>
    <list key="filters_list"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Filter Examples" to_port="example set input"/>
    <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    rfuentealba
  • komal_chenthamakomal_chenthama Member Posts: 3 Contributor I

    Hi @rfuentealba

     

    Thanks for the quick response. In my example, I mean particularly the character "C". 

     

    Cheers,

    Komal

     

    PS: I would very much appreciate a tutorial on regular expression with RapidMiner.

     

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 397   Unicorn

    Hi @komal_chenthama

     

    The regular expression you are looking for is pretty simple:

     

    C(.+)CC(.+)C(.+)C(.+)CC(.+)C

     

    Here is a capture on how it works when using it on the "Replace Operator".

     

    Screen Shot 2018-06-26 at 12.38.28 PM.pngSee? It recognizes only the patterns you want.

     

    Hope it helps.

Sign In or Register to comment.