Entity extraction - matching strings.
Hello,
I just started out with Rapid Miner. I am interested in mining text documents concerning security vulnerabilities and exposures.
For instance, in reports concerning exposures there is always list of affected products. Is it possible to match a phrase with specific string? For instance a sentence(title) : AFFECTED PRODUCTS section has the following description :
The following Philips XperIM Connect versions are affected:
- XperIM Connect system running Windows XP, Version 1.5.12 and prior versions.
I have successfully learned the basics of text processing and I would like to move on in order to solve this problem, as a result I would like to print out somehow The name of affected products and series.
Thanks for any possible hints.
Best Answers
-
IngoRM Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
Hi,
I am not 100% sure if I got you right but do you want to extract the product names and make those available as an extra attribute? And the sentences all follow the same pattern of "...are affected: (product name).."?
If I got you right, then the operator Replace with using regular expressions and capturing groups will be the solution. Regular expressions are a somewhat complext topic but it is worth to get into them if you are serious with text analytics but also in general with more complex data preparation tasks.
There are some online tutorials. A quick search brought up this one which looked decent on a first sight: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html
Cheers,
Ingo
0 -
IngoRM Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
Hi,
Yes, this might be a possible solution. You might also want to check out the extension from Aylien to see if this helps.
My suggestion was actually much simpler than training an entity extraction model (which might indeed be necessary). I was just suggesting if the text all follow the same structure, that just using regular expressions and replace could do the trick already.
Here is a process to show you what I mean:
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="7.2.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
<list key="attribute_values">
<parameter key="Text" value=""The following Philips XperIM Connect versions are affected: - XperIM Connect system running Windows XP, Version 1.5.12 and prior versions.""/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="generate_copy" compatibility="7.2.000" expanded="true" height="82" name="Generate Copy" width="90" x="246" y="34">
<parameter key="attribute_name" value="Text"/>
<parameter key="new_name" value="Product"/>
</operator>
<operator activated="true" class="replace" compatibility="7.2.000" expanded="true" height="82" name="Replace" width="90" x="380" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Product"/>
<parameter key="replace_what" value="The following (.*) versions are affected:.*"/>
<parameter key="replace_by" value="$1"/>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Copy" to_port="example set input"/>
<connect from_op="Generate Copy" from_port="example set output" to_op="Replace" to_port="example set input"/>
<connect from_op="Replace" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>So you have a couple of options to explore now :smileyvery-happy:
Cheers,
Ingo
0
Answers
Hi,
More precisely I would like to extract information, that has value for the end user. So instead of reading the whole document let the user mine several reports and get the affected product names as you mentioned. Would you expand the possible solution in RapidMiner a little bit, thus is it possible to get results using only RM?
Edit: I've just discovered a plugin called information extraction for RM, there are several articles about it, maybe that would be an interesting solution too?
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Scott