Regular expression - unwanted "noise" words

zacevzacev Member Posts: 6 Contributor II


I was text mining some documents recently that follow certain pattern - between a section of text, that repeats itself:

· SIMATIC S7-300 CPUs with Profinet support: All versions < V3.2.12
· SIMATIC S7-300 CPUs without Profinet support: All versions < V3.3.12

I have been using the following expresion : (AFFECTED\W+(?:\w+\W+){1,30}?DESCRIPTION) with an extract information operator in order to find "word A near word B" - in this case a phrase, which length is unknown.

My goal was to achieve the list of affected products, but I would like to get rid of the first two words and the last "DESCRIPTION". I was trying to use [^] operators to exclude this "noise" words but without effect. Can anyone help me with this case, maybe with a bettern pattern that I am using? I can't predict the number of words in {} this bracket so 30 is kind of fuzzy boundary. I am still learning the basic syntax, but I'd be grateful for any solutions that would solve the problem. 


I provide the full process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
<operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="7.2.000" expanded="true" height="68" name="Create Document" width="90" x="179" y="340">
<parameter key="text" value="SSA-818183: Denial-of-Service Vulnerability in S7-300 CPU&#10;Publication Date 2016-06-08&#10;Last Update 2016-06-08&#10;Current Version V1.0&#10;CVSSv3 Base Score 7.5&#10;SUMMARY&#10;Siemens has released a firmware update for the SIMATIC S7-300 CPU family which fixes a&#10;vulnerability that could allow remote attackers to perform a Denial-of-Service attack under&#10;certain conditions.&#10;AFFECTED PRODUCTS&#10;· SIMATIC S7-300 CPUs with Profinet support: All versions &lt; V3.2.12&#10;· SIMATIC S7-300 CPUs without Profinet support: All versions &lt; V3.3.12&#10;DESCRIPTION&#10;Products of the Siemens SIMATIC S7-300 CPU family have been designed for discrete and&#10;continuous control in industrial environments such as manufacturing, food and beverages,&#10;and chemical industries worldwide.&#10;Detailed information about the vulnerability is provided below.&#10;VULNERABILITY CLASSIFICATION&#10;The vulnerability classification has been performed by using the CVSS scoring system in&#10;version 3 (CVSSv3) ( The CVSS environmental score is specific to&#10;the customer's environment and will impact the overall CVSS score. The environmental score&#10;should therefore be individually defined by the customer to accomplish final scoring.&#10;Vulnerability Description (CVE-2016-3949)&#10;Specially crafted packets sent to port 102/tcp (ISO-TSAP) or via Profibus could cause&#10;the affected device to go into defect mode. A cold restart is required to recover the&#10;system.&#10;CVSS Base Score 7.5&#10;CVSS Vector CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H/E:P/RL:O/RC:C&#10;Mitigating Factors&#10;· The attacker must have network access to the affected device.&#10;· Protection-level 3 (Read/Write protection) mitigates the issue.&#10;· Siemens recommends operating the devices only within trusted networks [2].&#10;SOLUTION&#10;Siemens has released SIMATIC S7-300 firmware version V3.2.12 and V3.3.12 [1] which fixes&#10;the vulnerability and recommends customers to update to the latest version.&#10;As a general security measure Siemens strongly recommends to keep the firmware up-todate&#10;and to protect network access to the S7-300 CPUs with appropriate mechanisms. It is&#10;advised to configure the environment according to our operational guidelines [2] in order to&#10;run the devices in a protected IT environment.&#10;Siemens Security Advisory by Siemens ProductCERT&#10;SSA-818183 © Siemens AG 2016 Page 2 of 2&#10;ACKNOWLEDGEMENTS&#10;Siemens thanks the following for their support and efforts:&#10;· Mate J. Csorba, DNV GL, Marine Cybernetics Services for coordinated disclosure of&#10;the vulnerability.&#10;· Amund Sole, Norwegian University of Science and Technology for coordinated&#10;disclosure of the vulnerability.&#10;ADDITIONAL RESOURCES&#10;[1] The firmware update for SIMATIC S7-300 CPUs can be obtained here:&#10;;[2] An overview of the operational guidelines for Industrial Security (with the cell protection&#10;concept):&#10;;[3] Information about Industrial Security by Siemens:&#10;;[4] For further inquiries on vulnerabilities in Siemens products and solutions, please&#10;contact the Siemens ProductCERT:&#10;;HISTORY DATA&#10;V1.0 (2016-06-08): Publication Date&#10;DISCLAIMER&#10;See:"/>
<operator activated="true" class="text:extract_information" compatibility="7.2.000" expanded="true" height="68" name="Extract Information" width="90" x="380" y="340">
<parameter key="query_type" value="Regular Expression"/>
<list key="string_machting_queries">
<parameter key="Wydobylem" value="adasdas."/>
<list key="regular_expression_queries">
<parameter key="Affected Product" value="(AFFECTED\W+(?:\w+\W+){1,30}?DESCRIPTION)"/>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries"/>
<operator activated="true" class="text:documents_to_data" compatibility="7.2.000" expanded="true" height="82" name="Documents to Data" width="90" x="514" y="340">
<parameter key="text_attribute" value="Text"/>
<parameter key="label_attribute" value="Test attribute"/>
<operator activated="true" class="write_excel" compatibility="7.2.000" expanded="true" height="82" name="Write Excel" width="90" x="634" y="340">
<parameter key="excel_file" value="C:\Users\John\Desktop\save.xlsx"/>
<connect from_op="Create Document" from_port="output" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
<connect from_op="Documents to Data" from_port="example set" to_op="Write Excel" to_port="input"/>
<connect from_op="Write Excel" from_port="through" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>

The results are exported to an excel file with my custom name. The perfect solution would be only what is in between two key phrases AFFECTED PRODUCTS and DESCRIPTION. Thanks for any feedback possible.




  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder



    Can't you just use a replace operator with the following reg exp:




    and in replace-by you use $1 to refer to the capturing group identified by the parenthesis?  This way you would extract just the part between the two elements like you have described?  The only case where this would fail is where AFFECTED and or DESCRIPTION might be part of the stuff in between...




Sign In or Register to comment.