Options

Entity extraction - matching strings.

zacevzacev Member Posts: 6 Contributor II
edited November 2018 in Help

Hello,

I just started out with Rapid Miner. I am interested in mining text documents concerning security vulnerabilities and exposures.

For instance, in reports concerning exposures there is always list of affected products. Is it possible to match a phrase with specific string? For instance a sentence(title) : AFFECTED PRODUCTS section has the following description : 

The following Philips XperIM Connect versions are affected:
- XperIM Connect system running Windows XP, Version 1.5.12 and prior versions.

I have successfully learned the basics of text processing and I would like to move on in order to solve this problem, as a result I would like to print out somehow The name of affected products and series.

 

Thanks for any possible hints.

Best Answers

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Solution Accepted

    Hi,

     

    I am not 100% sure if I got you right but do you want to extract the product names and make those available as an extra attribute?  And the sentences all follow the same pattern of "...are affected: (product name).."?

     

    If I got you right, then the operator Replace with using regular expressions and capturing groups will be the solution.  Regular expressions are a somewhat complext topic but it is worth to get into them if you are serious with text analytics but also in general with more complex data preparation tasks.

     

    There are some online tutorials.  A quick search brought up this one which looked decent on a first sight: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

     

    Cheers,

    Ingo

  • Options
    IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Solution Accepted

    Hi,

     

    Yes, this might be a possible solution.  You might also want to check out the extension from Aylien to see if this helps.

     

    My suggestion was actually much simpler than training an entity extraction model (which might indeed be necessary).  I was just suggesting if the text all follow the same structure, that just using regular expressions and replace could do the trick already.

     

    Here is a process to show you what I mean:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="7.2.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
    <list key="attribute_values">
    <parameter key="Text" value="&quot;The following Philips XperIM Connect versions are affected: - XperIM Connect system running Windows XP, Version 1.5.12 and prior versions.&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_copy" compatibility="7.2.000" expanded="true" height="82" name="Generate Copy" width="90" x="246" y="34">
    <parameter key="attribute_name" value="Text"/>
    <parameter key="new_name" value="Product"/>
    </operator>
    <operator activated="true" class="replace" compatibility="7.2.000" expanded="true" height="82" name="Replace" width="90" x="380" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Product"/>
    <parameter key="replace_what" value="The following (.*) versions are affected:.*"/>
    <parameter key="replace_by" value="$1"/>
    </operator>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Generate Copy" to_port="example set input"/>
    <connect from_op="Generate Copy" from_port="example set output" to_op="Replace" to_port="example set input"/>
    <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    So you have a couple of options to explore now :smileyvery-happy:

     

    Cheers,

    Ingo

Answers

  • Options
    zacevzacev Member Posts: 6 Contributor II

    Hi,

    More precisely I would like to extract information, that has value for the end user. So instead of reading the whole document let the user mine several reports and get the affected product names as you mentioned. Would you expand the possible solution in RapidMiner a little bit, thus is it possible to get results using only RM?

     

    Edit: I've just discovered a plugin called information extraction for RM, there are several articles about it, maybe that would be an interesting solution too?

  • Options
    Robin1992Robin1992 Member Posts: 5 Contributor II
    Hi, I have a similar problem but still seeking for a solution... do you have the final model for me? that you produced  in rapid miner
  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    I would recommend using Entity Extraction operators from either Rosette or Aylien.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    agree with @Telcontar120. I will say I now guide everyone to Rosette rather than Aylien. Aylien no longer supports their extension and it has a high error rate (i.e. numerous bugs, not user errors).

    Scott

Sign In or Register to comment.