Extract drug names

t_klokt_klok Member Posts: 3 Contributor I
edited December 2018 in Help

I am a medical doctor and doing research.

I have an excel sheet with freetext wich contains drugs names.

I want to filter out these drug names and count how many drugs are noted in each field (excel cell). 

 

Any suggestions??

Tagged:

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi @t_klok,

     

    this is one of the problems were i started with "hey that's easy" and it turned out to be a 15operator process. Maybe there is another way to do this? @sgenzer might find one :). Anyway, my solution is attached.

     

    You might want to link up with @SvenVanPoucke . He is a physician and our medical expert in the community.

     

    Best,

    Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="Subprocess (2)" width="90" x="45" y="34">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="8.1.000" expanded="true" height="68" name="Generate Data by User Specification (4)" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="text" value="&quot;This is a drug which includes mydrug&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="8.1.000" expanded="true" height="68" name="Generate Data by User Specification (5)" width="90" x="45" y="136">
    <list key="attribute_values">
    <parameter key="text" value="&quot;just a text&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="8.1.000" expanded="true" height="68" name="Generate Data by User Specification (6)" width="90" x="45" y="238">
    <list key="attribute_values">
    <parameter key="text" value="&quot;thirddrug in another text twice: thirddrug&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="8.1.000" expanded="true" height="124" name="Append (2)" width="90" x="179" y="85"/>
    <connect from_op="Generate Data by User Specification (4)" from_port="output" to_op="Append (2)" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (5)" from_port="output" to_op="Append (2)" to_port="example set 2"/>
    <connect from_op="Generate Data by User Specification (6)" from_port="output" to_op="Append (2)" to_port="example set 3"/>
    <connect from_op="Append (2)" from_port="merged set" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Dummy data for drug texts you can replace this with read excel</description>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="8.1.000" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
    <description align="center" color="transparent" colored="false" width="126">Att needs to be text to work with Process Documents</description>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="34">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
    <operator activated="false" class="operator_toolbox:FilterTokensUsingExampleSet" compatibility="0.11.000-SNAPSHOT" expanded="true" height="82" name="Filter Tokens Using ExampleSet" width="90" x="380" y="238">
    <parameter key="attribute" value="drugname"/>
    <description align="center" color="transparent" colored="false" width="126">only use specifed drug names</description>
    </operator>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Generate bag of words</description>
    </operator>
    <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="Subprocess" width="90" x="447" y="136">
    <process expanded="true">
    <operator activated="true" class="generate_data_user_specification" compatibility="8.1.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
    <list key="attribute_values">
    <parameter key="drugname" value="&quot;mydrug&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="8.1.000" expanded="true" height="68" name="Generate Data by User Specification (2)" width="90" x="45" y="136">
    <list key="attribute_values">
    <parameter key="drugname" value="&quot;anotherdrug&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="generate_data_user_specification" compatibility="8.1.000" expanded="true" height="68" name="Generate Data by User Specification (3)" width="90" x="45" y="238">
    <list key="attribute_values">
    <parameter key="drugname" value="&quot;thirddrug&quot;"/>
    </list>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="append" compatibility="8.1.000" expanded="true" height="124" name="Append" width="90" x="179" y="85"/>
    <connect from_op="Generate Data by User Specification" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Generate Data by User Specification (2)" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Generate Data by User Specification (3)" from_port="output" to_op="Append" to_port="example set 3"/>
    <connect from_op="Append" from_port="merged set" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">Dummy data for drug names. You can replace this with read excel</description>
    </operator>
    <operator activated="true" class="set_role" compatibility="8.1.000" expanded="true" height="82" name="Set Role" width="90" x="581" y="136">
    <parameter key="attribute_name" value="drugname"/>
    <parameter key="target_role" value="id"/>
    <list key="set_additional_roles"/>
    <description align="center" color="transparent" colored="false" width="126">id will become header in transpose</description>
    </operator>
    <operator activated="true" class="transpose" compatibility="8.1.000" expanded="true" height="82" name="Transpose" width="90" x="715" y="136"/>
    <operator activated="true" class="data_to_weights" compatibility="8.1.000" expanded="true" height="82" name="Data to Weights" width="90" x="849" y="136"/>
    <operator activated="true" class="select_by_weights" compatibility="8.1.000" expanded="true" height="103" name="Select by Weights" width="90" x="1031" y="34">
    <description align="center" color="transparent" colored="false" width="126">Only let attributes through which were present in the lower exa</description>
    </operator>
    <operator activated="true" class="aggregate" compatibility="8.1.000" expanded="true" height="82" name="Aggregate" width="90" x="1184" y="34">
    <parameter key="use_default_aggregation" value="true"/>
    <parameter key="default_aggregation_function" value="sum"/>
    <list key="aggregation_attributes"/>
    </operator>
    <connect from_op="Subprocess (2)" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Select by Weights" to_port="example set input"/>
    <connect from_op="Subprocess" from_port="out 1" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Transpose" to_port="example set input"/>
    <connect from_op="Transpose" from_port="example set output" to_op="Data to Weights" to_port="example set"/>
    <connect from_op="Data to Weights" from_port="weights" to_op="Select by Weights" to_port="weights"/>
    <connect from_op="Select by Weights" from_port="example set output" to_op="Aggregate" to_port="example set input"/>
    <connect from_op="Aggregate" from_port="example set output" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • t_klokt_klok Member Posts: 3 Contributor I

    Hi Martin,

     

    Rapid(miner) answers..

    Thx I think I understand.

    But I would like to filter out drugnames using a list which contains the drugnames.
    I do not want to enter all the reference drugnames by hand....

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Hi,

     

    sure you can just read in the Excel file instead of generating them by hand. That was just to generate some dummy data.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @t_klok - I'd want to see the data before really weighing in but just from what you describe I would use the Text Processing extension, tokenize, and then Filter Tokens (Dictionary) with the drug names. It's very similar to what @mschmitz built with his XML.


    Scott

     

  • DocMusherDocMusher Member Posts: 333 Unicorn

    Hi each country provides a list with official drug names. Additionally, SNOMED can help you find drug names in a text. 

    Schermafbeelding 2018-03-08 om 17.01.01.png


    @sgenzer wrote:

    hi @t_klok - I'd want to see the data before really weighing in but just from what you describe I would use the Text Processing extension, tokenize, and then Filter Tokens (Dictionary) with the drug names. It's very similar to what @mschmitz built with his XML.


    Scott

     


     

  • DocMusherDocMusher Member Posts: 333 Unicorn

    Hi,

    Please take a look at the technology Microsoft is testing: https://www.youtube.com/watch?v=c6exHAzNwy4#action=share

    Cheers Sven

  • t_klokt_klok Member Posts: 3 Contributor I

    Thank you all.

     

    I have a (large) list of drugnames and I want to see if freetext fields in an xcl contain any of these names.

    So I query an xcl file with freetext cells and the reference is a file with all drugnames.

    I do not want to enter all these drugnames one by one in rapidminer.

Sign In or Register to comment.