Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

RegEx for Names, Numbers and Email ids

7amritaarora77amritaarora7 Member Posts: 25 Contributor II
edited November 2018 in Help

Hi all

I want to extract names, numbers and emails from text. I thought of doing so using RegEx. So, can someone please help with suggesting a regex that works well in RapidMiner? 

Or is there any other process to do so?

Thanks in advance

Amrita

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist

    Hi Amrita,

     

    i think it's very hard to design such a general regex. Email addresses are kind of easy with (.+)@(.+)\.(.+) but for names?

    Have you tried the aylien Extract Entities operator?

     

    ~martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor II

    Hi @mschmitz

    Thanks for the suggestion about Extract Entitites. I tried that, it's working,but the accuracy is not good.

    for eg- In the database text, there is "Mr Narender Choudhary", it's a name, but, while extracting names, it extracts only Chaoudhary, not the entire name.

    Is there any other solution for extracting names or improving this operator?


  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Oh I have this issue all the time.  I use the Generate Attributes operator and lots of text parsing.  Here's an example of a "building block" I use to take a person's name in the form of "LastName, FirstName MiddleInitial" and creates three new attributes for Last, First and MI.

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="generate_attributes" compatibility="7.2.002" expanded="true" height="82" name="Generate Attributes" width="90" x="179" y="34">
    <list key="function_descriptions">
    <parameter key="StudentLastOrSurname" value="upper(prefix([Student_Name],index([Student_Name],&quot;,&quot;)))"/>
    <parameter key="FN" value="suffix([Student_Name],length([Student_Name])-length([StudentLastOrSurname])-2)"/>
    <parameter key="StudentFirstName" value="upper(prefix([FN],index([FN],&quot; &quot;)))"/>
    <parameter key="StudentMI" value="if(contains([FN],&quot; &quot;),&#10;upper(suffix([FN],length([FN])-index([FN],&quot; &quot;)-1)),&quot;&quot;)"/>
    </list>
    </operator>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    </process>
    </operator>
    </process>

     

    Scott

     

  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor II

    Hi @sgenzer

    I tried your solution with my database, but it didn't work.

    Any other way out?

     

    Regards

    Amrita

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hmm can you post your process and a few rows of your data so I can take a look at it? 

  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor II

    Hi @sgenzer @mschmitz

     

    Thanks for the help. I worked on it again and the issue is now solved using Extract Entities operator. But, while using this operator - this is what that needs to be kept in mind- the data being analyzed should meaningful (no unnecessary special characters and line breaks), if it's not, it does not give accurate results.

    Thanks again! :)

     

    Regards

    Amrita

  • 7amritaarora77amritaarora7 Member Posts: 25 Contributor II

    @mschmitz @sgenzer

     

    Hi

     

    Now, that I have a process running perfectly using Extract Entities operator by Aylien. My next step us to create a process on my own, that works at the least, exactly like Aylien, but also, few improvements as an add on. So, will I need to train and test each category within Extract entities or is there any other solution to this?

    Need some guidance as to how to proceed for creating something similar to Extract Entities.


    Thanks in advance

    Regards

    Amrita

Sign In or Register to comment.