"looping through regex matches and groups"

markus_dresselmarkus_dressel Member Posts: 5 Contributor I
edited June 2019 in Help

Hi community,

I might have an easy questions regarding handling regex matches. 

I have a document (loaded with the document operator), and now I want to use a regex to retrieve a certain part of the document. My regex code got e.g. three matches. So when running rapidminer, all three matches will be shown together (appended/joined together). So my questions is, if there is a way to loop through all regex matches like I can do it in Java or Python ?

For example like:

<SPAN class="kwd">import</SPAN><SPAN class="pln"> re

s </SPAN><SPAN class="pun">=</SPAN> <SPAN class="str">"ABC12DEF3G56HIJ7"</SPAN><SPAN class="pln">
pattern </SPAN><SPAN class="pun">=</SPAN><SPAN class="pln"> re</SPAN><SPAN class="pun">.</SPAN><SPAN class="pln">compile</SPAN><SPAN class="pun">(</SPAN><SPAN class="pln">r</SPAN><SPAN class="str">'([A-Z]+)([0-9]+)'</SPAN><SPAN class="pun">)</SPAN>

<SPAN class="kwd">for</SPAN> <SPAN class="pun">(</SPAN><SPAN class="pln">letters</SPAN><SPAN class="pun">,</SPAN><SPAN class="pln"> numbers</SPAN><SPAN class="pun">)</SPAN> <SPAN class="kwd">in</SPAN><SPAN class="pln"> re</SPAN><SPAN class="pun">.</SPAN><SPAN class="pln">findall</SPAN><SPAN class="pun">(</SPAN><SPAN class="pln">pattern</SPAN><SPAN class="pun">,</SPAN><SPAN class="pln"> s</SPAN><SPAN class="pun">):</SPAN>
    <SPAN class="kwd">pass # do anything</SPAN>

 This is just a sample code, and not my specific task. I just want to know how to loop through regex matches.

 

I hope my question is quite clear :-)

 

Best regards,

 

Markus 

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi Markus,

     

    have a look at the attached process. It builds something like this with operators. It uses the new 7.4 loop. There is for sure a way to built this with 7.3 as well.

     

    ~Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="text:create_document" compatibility="7.4.001" expanded="true" height="68" name="Create Document" width="90" x="112" y="238">
    <parameter key="text" value="Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
    <list key="attribute_values">
    <parameter key="regex" value="&quot;.*amet.*&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="7.4.000" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="112" y="136">
    <list key="attribute_values">
    <parameter key="regex" value="&quot;.*Lorem.*&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="7.4.000" expanded="true" height="103" name="Append" width="90" x="246" y="34"/>
    <operator activated="true" class="concurrency:loop" compatibility="7.4.000" expanded="true" height="103" name="Loop" width="90" x="447" y="85">
    <parameter key="number_of_iterations" value="2"/>
    <parameter key="reuse_results" value="true"/>
    <process expanded="true">
    <operator activated="true" class="extract_macro" compatibility="7.4.000" expanded="true" height="68" name="Extract Macro" width="90" x="112" y="34">
    <parameter key="macro" value="myRegex"/>
    <parameter key="macro_type" value="data_value"/>
    <parameter key="attribute_name" value="regex"/>
    <parameter key="example_index" value="%{iteration}"/>
    <list key="additional_macros"/>
    </operator>
    <operator activated="true" class="delay" compatibility="7.4.000" expanded="true" height="103" name="Delay" width="90" x="246" y="85">
    <parameter key="delay" value="none"/>
    <description align="center" color="transparent" colored="false" width="126">Execution Order</description>
    </operator>
    <operator activated="true" class="text:extract_information" compatibility="7.4.001" expanded="true" height="68" name="Extract Information" width="90" x="447" y="136">
    <parameter key="query_type" value="Regular Expression"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries">
    <parameter key="matches_%{myRegex}" value="%{myRegex}"/>
    </list>
    <list key="regular_region_queries"/>
    <list key="xpath_queries"/>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    </operator>
    <connect from_port="input 1" to_op="Extract Macro" to_port="example set"/>
    <connect from_port="input 2" to_op="Delay" to_port="through 2"/>
    <connect from_op="Extract Macro" from_port="example set" to_op="Delay" to_port="through 1"/>
    <connect from_op="Delay" from_port="through 1" to_port="output 1"/>
    <connect from_op="Delay" from_port="through 2" to_op="Extract Information" to_port="document"/>
    <connect from_op="Extract Information" from_port="document" to_port="output 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="63"/>
    <portSpacing port="source_input 3" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    <portSpacing port="sink_output 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:documents_to_data" compatibility="7.4.001" expanded="true" height="82" name="Documents to Data" width="90" x="581" y="136">
    <parameter key="text_attribute" value="text"/>
    </operator>
    <connect from_op="Create Document" from_port="output" to_op="Loop" to_port="input 2"/>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Append" from_port="merged set" to_op="Loop" to_port="input 1"/>
    <connect from_op="Loop" from_port="output 2" to_op="Documents to Data" to_port="documents 1"/>
    <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • kaymankayman Member Posts: 662 Unicorn

    You can use the replace dictionairy operator for this purpose.

     

    Easiest way to proceed is to create a csv containing the regex you want to use (the from atribute) and the replacement (the to atribute), tell the operator to use regular expressions and of you go. It will loop through the whole file and replaces content accordingly.

     

  • markus_dresselmarkus_dressel Member Posts: 5 Contributor I

    Hi,

    thank you for the quick response and provided solution. I have loaded your solution but maybe I have not correctly described my problem:

    Lets say, we have a document with the following text:

    Item here is some important text Item

    Here is no important text

    Item here is some additional important text Item

     

    If I will use the regex: "(?s)(?i)Item.*?Item" , I have two matches

    1: Item here is some important text Item

    2: Item here is some additional important text Item

     

    See https://regex101.com/r/WYn2nm/1

     

    So the question is, how can I loop through each match and do some stuff with it, keeping in mind that the amount of matches is not static in different documents.

    Something like that

      

    for match in regex.matches:
    if len(match) > 7:
    do stuff
    Else
    do other sutff

    Best regards and thank you for your great support

    Markus

  • kaymankayman Member Posts: 662 Unicorn

    I see. As you stated you know how to do it in python so how about using an execute python process? You just create your regex script, pump your data through it and you are covered.

     

    Should be pretty simple this way, probably you can achieve it with plain RM vanilla but without having a clear idea on the data you have and what you want to achieve it's a bit complex to support.

     

    Something like this :

     

    <?xml version="1.0" encoding="UTF-8"?>
    <process version="7.3.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.3.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="python_scripting:execute_python" compatibility="7.2.000" expanded="true" height="82" name="regex_on_steroids" width="90" x="313" y="34">
    <parameter key="script" value="import pandas as pd&#10;import re&#10;&#10;def rm_main(data):&#10;&#10;&#9;for index,row in data.iterrows():&#10;&#9;&#9;# do something with data['myFieldThatNeedsRegexCleanup']&#10;&#9;# return the new frame for further analysis&#10;&#9;return data&#10;"/>
    </operator>
    <connect from_port="input 1" to_op="regex_on_steroids" to_port="input 1"/>
    <connect from_op="regex_on_steroids" from_port="output 1" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
Sign In or Register to comment.