Before, after, timestamp paragraph

DocMusherDocMusher Member Posts: 333 Unicorn
edited November 2018 in Help

How can paragraphs be extracted from free text using different types of data-time representation?

 

Case:

Patient x, birth date February 5 1960

At age of 5 years, dental surgery

20/10/2012 laparoscopy with postoperative infection with pseudomonas, allergy for antibiotics without further investigation

2010 traffic accident

1976-03-10 ankle surgery

Today admitted to ICU

 

This text should be processed to result in a 2 columns: Text and Date

If this extraction is bulletproof, a sankey chart from a cohort of patients would be possible to be made.

 

Thanks

Sven

Sankey.png

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    So each line becomes a text & date? 

     

    Date | Text

    N/A | At age of 5 years, dental surgery

    20/10/2012 | laparoscopy with postoperative infection with pseudomonas, allergy for antibiotics without further investigation

    01/06/2010 | traffic accident *estimated

    1976-03-10 | ankle surgery

    now() | admitted to ICU

     

     

     
  • DocMusherDocMusher Member Posts: 333 Unicorn

    Thanks for the feedback.

    In fact I need a date (sometimes calculated e.g. at 5 years should be calculated from date of birth) and a text column. The problem is that in free text a date is sometimes related to text that follows the date or vice versa.

    Is there a trick to end up with 2 colums text and date?

    Thanks

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    There is no simple "trick" IMO, it's going to get complex.  However, with a structured database (perhaps look into Neo4J for this?) then it should be possible. 

     

    What you're looking to do is build relationships between date stamped text entities with other date stamped text entities. 

    Then using those date stamped text entities to help guide extraction of new date stamped text entities from text containing dates. 

     

     

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    Hi,

     

    your best guess is using a list of rules from experience in Generate Attributes. For example:

     

     

    if(matches(text, "^ *([0-9]{4}-[0-9][0-9]?-[0-9][0-9]?) .*"), 
    date_parse_custom(replaceAll(text, "^ *([0-9-]+).*", "$1"), "yyyy-MM-dd", "en"),
    if(matches(text, "^ *([0-9][0-9]?/[0-9][0-9]?/[0-9]{4}) .*"),
    date_parse_custom(replaceAll(text, "^ *([0-9/]+).*", "$1"), "dd/MM/yyyy", "en"),
    if(matches(lower(text), "^ *(today|now).*"),
    date_now(),
    MISSING_DATE
    )
    )
    )

    This catches three of the dates in your input and is already becoming hard to read. You could always get conversion errors from stuff like 33/11/2017 and so on, so it would be best to apply this line for line in an exception handler. 

     

     

    You might want to try using a library like lubridate in R to more easily convert the date string candidates you identified in the text.

     

    But it will be a mess anyway, I wish you good luck. 

     

    Regards,

    Balázs

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    @BalazsBarany the date extraction part is one of the more easiest parts.  It's the link between those dates where it becomes more tricky.  (Which is why I suggested a graph database)

     

    Here's an example of a process that I use for date extraction from text, where the date could be in multiple formats.

     

    https://community.rapidminer.com/t5/Original-Rapid-I-Forum/Extracting-date-from-textfiles/m-p/30203

     

    You could use this with the addition of predictive model learned on the historically captured text data to add context such as "5 years old = dateofbirth + 5 years

     

     

     
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    there's probably some good fuzzy matching solution to this...it would be very nice if RapidMiner had some fancy algorithms in some kind of data prep wizard to do that for us, eh? Just sayin', @IngoRM...  :)

     

    Meanwhile I actually was doing something similar a year or so ago - trying to parse out dates from newspaper death notices. Here is the block I created (it's a complete mess but maybe something inside is useful for you? It's a 100% brute force solution)

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="124" name="DateOfObit" width="90" x="45" y="34">
    <process expanded="true">
    <operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="82" name="generate DateOfObitRAW" width="90" x="45" y="340">
    <process expanded="true">
    <operator activated="true" class="multiply" compatibility="8.2.000" expanded="true" height="103" name="Multiply (23)" width="90" x="45" y="85"/>
    <operator activated="true" class="generate_id" compatibility="8.2.000" expanded="true" height="82" name="Generate ID (42)" width="90" x="179" y="442"/>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Slice DateOfObitRAW" width="90" x="246" y="34">
    <parameter key="create_word_vector" value="false"/>
    <parameter key="add_meta_information" value="false"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
    <parameter key="att1" value="1.0"/>
    </list>
    <process expanded="true">
    <operator activated="true" class="text:keep_document_parts" compatibility="7.5.000" expanded="true" height="68" name="DateOfObit Regex" width="90" x="313" y="34">
    <parameter key="extraction_regex" value="\s\w+\W+[0-9]+\W+1[0-9]+"/>
    </operator>
    <connect from_port="document" to_op="DateOfObit Regex" to_port="document"/>
    <connect from_op="DateOfObit Regex" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="generate_id" compatibility="8.2.000" expanded="true" height="82" name="Generate ID (43)" width="90" x="380" y="34"/>
    <operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role (17)" width="90" x="514" y="34">
    <parameter key="attribute_name" value="text"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes (74)" width="90" x="648" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    </operator>
    <operator activated="true" class="join" compatibility="8.2.000" expanded="true" height="82" name="Join (32)" width="90" x="782" y="391">
    <list key="key_attributes"/>
    </operator>
    <operator activated="true" class="trim" compatibility="8.2.000" expanded="true" height="82" name="Trim (35)" width="90" x="916" y="136"/>
    <operator activated="true" class="rename" compatibility="8.2.000" expanded="true" height="82" name="Rename (55)" width="90" x="1050" y="136">
    <parameter key="old_name" value="text"/>
    <parameter key="new_name" value="DateOfObitRAW"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="text_to_nominal" compatibility="8.2.000" expanded="true" height="82" name="Text to Nominal (4)" width="90" x="1184" y="136"/>
    <operator activated="true" class="replace" compatibility="8.2.000" expanded="true" height="82" name="Replace (80)" width="90" x="1318" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="DateOfObitRAW"/>
    <parameter key="replace_what" value=","/>
    <description align="center" color="transparent" colored="false" width="126">get rid of commas</description>
    </operator>
    <connect from_port="in 1" to_op="Multiply (23)" to_port="input"/>
    <connect from_op="Multiply (23)" from_port="output 1" to_op="Slice DateOfObitRAW" to_port="example set"/>
    <connect from_op="Multiply (23)" from_port="output 2" to_op="Generate ID (42)" to_port="example set input"/>
    <connect from_op="Generate ID (42)" from_port="example set output" to_op="Join (32)" to_port="right"/>
    <connect from_op="Slice DateOfObitRAW" from_port="example set" to_op="Generate ID (43)" to_port="example set input"/>
    <connect from_op="Generate ID (43)" from_port="example set output" to_op="Set Role (17)" to_port="example set input"/>
    <connect from_op="Set Role (17)" from_port="example set output" to_op="Select Attributes (74)" to_port="example set input"/>
    <connect from_op="Select Attributes (74)" from_port="example set output" to_op="Join (32)" to_port="left"/>
    <connect from_op="Join (32)" from_port="join" to_op="Trim (35)" to_port="example set input"/>
    <connect from_op="Trim (35)" from_port="example set output" to_op="Rename (55)" to_port="example set input"/>
    <connect from_op="Rename (55)" from_port="example set output" to_op="Text to Nominal (4)" to_port="example set input"/>
    <connect from_op="Text to Nominal (4)" from_port="example set output" to_op="Replace (80)" to_port="example set input"/>
    <connect from_op="Replace (80)" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">generate DateOfObitRAW</description>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples (59)" width="90" x="179" y="340">
    <parameter key="parameter_expression" value="finds(att1,&quot;.*[0-9].*&quot;)"/>
    <parameter key="condition_class" value="expression"/>
    <list key="filters_list"/>
    <description align="center" color="transparent" colored="false" width="126">contains a number (likely contains a date of obit)</description>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples (60)" width="90" x="313" y="34">
    <parameter key="parameter_expression" value="finds(att1,&quot;.*[0-9].*&quot;)"/>
    <list key="filters_list">
    <parameter key="filters_entry_key" value="DateOfObitRAW.is_not_missing."/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">DateOfObitRAW is not missing</description>
    </operator>
    <operator activated="true" breakpoints="after" class="subprocess" compatibility="8.2.000" expanded="true" height="124" name="fix wonky DateOfObitRAW records" width="90" x="514" y="136">
    <process expanded="true">
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes (95)" width="90" x="45" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="DateOfObitRAW"/>
    <parameter key="invert_selection" value="true"/>
    <description align="center" color="transparent" colored="false" width="126">get rid of DateOfObitRAW</description>
    </operator>
    <operator activated="true" class="generate_copy" compatibility="8.2.000" expanded="true" height="82" name="Generate Copy (12)" width="90" x="179" y="34">
    <parameter key="attribute_name" value="att1"/>
    <parameter key="new_name" value="DateOfObitRAW"/>
    <description align="center" color="transparent" colored="false" width="126">copy att1 to DateOfObitRAW</description>
    </operator>
    <operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="82" name="Subprocess (187)" width="90" x="313" y="34">
    <process expanded="true">
    <operator activated="true" class="multiply" compatibility="8.2.000" expanded="true" height="103" name="Multiply (29)" width="90" x="45" y="34"/>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data (8)" width="90" x="246" y="34">
    <parameter key="create_word_vector" value="false"/>
    <parameter key="add_meta_information" value="false"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="select_attributes_and_weights" value="true"/>
    <list key="specify_weights">
    <parameter key="DateOfObitRAW" value="1.0"/>
    </list>
    <process expanded="true">
    <operator activated="false" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts" width="90" x="112" y="34">
    <parameter key="deletion_regex" value="[A-Za-z][a-z]+[-,]"/>
    <description align="center" color="transparent" colored="false" width="126">word with comma at end</description>
    </operator>
    <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts (2)" width="90" x="246" y="34">
    <parameter key="deletion_regex" value="[A-Z][-.]"/>
    <description align="center" color="transparent" colored="false" width="126">middle initials</description>
    </operator>
    <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts (3)" width="90" x="380" y="34">
    <parameter key="deletion_regex" value="[BCEGHIKLPQRTUVWXYZ][a-z]+"/>
    <description align="center" color="transparent" colored="false" width="126">words starting with any letter than is not a month letter</description>
    </operator>
    <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts (4)" width="90" x="514" y="34">
    <parameter key="deletion_regex" value="\s[a-z]+"/>
    <description align="center" color="transparent" colored="false" width="126">any word that does not start with a capital letter</description>
    </operator>
    <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts (6)" width="90" x="648" y="34">
    <parameter key="deletion_regex" value="A[^up][a-z]+|Au[^g][a-z]+|Ap[^r][a-z]+|August[a-z]+"/>
    <description align="center" color="transparent" colored="false" width="126">Words that start with A that are not Aug or Apr</description>
    </operator>
    <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts (5)" width="90" x="782" y="34">
    <parameter key="deletion_regex" value="M[^a][a-z]+|Ma[^ry][a-z]+|Mar[^c][a-z]+|Mary"/>
    <description align="center" color="transparent" colored="false" width="126">Words that start with M that are not May or March</description>
    </operator>
    <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts (7)" width="90" x="916" y="34">
    <parameter key="deletion_regex" value="F[^e][a-z]+|Fe[^b][a-z]+"/>
    <description align="center" color="transparent" colored="false" width="126">Words that start with F that are not Feb</description>
    </operator>
    <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts (8)" width="90" x="1050" y="34">
    <parameter key="deletion_regex" value="N[^o][a-z]+|No[^v][a-z]+"/>
    <description align="center" color="transparent" colored="false" width="126">Words that start with N that are not Nov</description>
    </operator>
    <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts (9)" width="90" x="1184" y="34">
    <parameter key="deletion_regex" value="D[^e][a-z]+|De[^c][a-z]+|Dec[^e][a-z]+|De[^c]\s"/>
    <description align="center" color="transparent" colored="false" width="126">Words that start with D that are not Dec</description>
    </operator>
    <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts (10)" width="90" x="1318" y="34">
    <parameter key="deletion_regex" value="S[^e][a-z]+|Se[^p][a-z]+"/>
    <description align="center" color="transparent" colored="false" width="126">Words that start with S that are not Sep</description>
    </operator>
    <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts (11)" width="90" x="1452" y="34">
    <parameter key="deletion_regex" value="J[^au][a-z]+|Ja[^n][a-z]+|Jan[^u][a-z]+|Ju[^ln][a-z]+|Ja[^n][a-z]+|Jul[^y][a-z]+|Ja[^n]|Jane"/>
    <description align="center" color="transparent" colored="false" width="126">Words that start with J that are not Jun or Jul or Jan</description>
    </operator>
    <operator activated="true" class="text:remove_document_parts" compatibility="7.5.000" expanded="true" height="68" name="Remove Document Parts (12)" width="90" x="1586" y="34">
    <parameter key="deletion_regex" value="[A-Z]\W+|[A-Z][a-z]\W+|[0-9][a-z]\W|th|\W[a-z]+"/>
    <description align="center" color="transparent" colored="false" width="126">Single letters followed by non-word chars</description>
    </operator>
    <connect from_port="document" to_op="Remove Document Parts (2)" to_port="document"/>
    <connect from_op="Remove Document Parts (2)" from_port="document" to_op="Remove Document Parts (3)" to_port="document"/>
    <connect from_op="Remove Document Parts (3)" from_port="document" to_op="Remove Document Parts (4)" to_port="document"/>
    <connect from_op="Remove Document Parts (4)" from_port="document" to_op="Remove Document Parts (6)" to_port="document"/>
    <connect from_op="Remove Document Parts (6)" from_port="document" to_op="Remove Document Parts (5)" to_port="document"/>
    <connect from_op="Remove Document Parts (5)" from_port="document" to_op="Remove Document Parts (7)" to_port="document"/>
    <connect from_op="Remove Document Parts (7)" from_port="document" to_op="Remove Document Parts (8)" to_port="document"/>
    <connect from_op="Remove Document Parts (8)" from_port="document" to_op="Remove Document Parts (9)" to_port="document"/>
    <connect from_op="Remove Document Parts (9)" from_port="document" to_op="Remove Document Parts (10)" to_port="document"/>
    <connect from_op="Remove Document Parts (10)" from_port="document" to_op="Remove Document Parts (11)" to_port="document"/>
    <connect from_op="Remove Document Parts (11)" from_port="document" to_op="Remove Document Parts (12)" to_port="document"/>
    <connect from_op="Remove Document Parts (12)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="generate_id" compatibility="8.2.000" expanded="true" height="82" name="Generate ID (50)" width="90" x="246" y="136"/>
    <operator activated="true" class="generate_id" compatibility="8.2.000" expanded="true" height="82" name="Generate ID (51)" width="90" x="380" y="34"/>
    <operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role (20)" width="90" x="514" y="34">
    <parameter key="attribute_name" value="text"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes (96)" width="90" x="648" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    </operator>
    <operator activated="true" class="join" compatibility="8.2.000" expanded="true" height="82" name="Join (37)" width="90" x="782" y="85">
    <list key="key_attributes"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes (97)" width="90" x="916" y="85">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="DateOfObitRAW"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="rename" compatibility="8.2.000" expanded="true" height="82" name="Rename (76)" width="90" x="1050" y="85">
    <parameter key="old_name" value="text"/>
    <parameter key="new_name" value="DateOfObitRAW"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <operator activated="true" class="trim" compatibility="8.2.000" expanded="true" height="82" name="Trim (49)" width="90" x="1184" y="85"/>
    <connect from_port="in 1" to_op="Multiply (29)" to_port="input"/>
    <connect from_op="Multiply (29)" from_port="output 1" to_op="Process Documents from Data (8)" to_port="example set"/>
    <connect from_op="Multiply (29)" from_port="output 2" to_op="Generate ID (50)" to_port="example set input"/>
    <connect from_op="Process Documents from Data (8)" from_port="example set" to_op="Generate ID (51)" to_port="example set input"/>
    <connect from_op="Generate ID (50)" from_port="example set output" to_op="Join (37)" to_port="right"/>
    <connect from_op="Generate ID (51)" from_port="example set output" to_op="Set Role (20)" to_port="example set input"/>
    <connect from_op="Set Role (20)" from_port="example set output" to_op="Select Attributes (96)" to_port="example set input"/>
    <connect from_op="Select Attributes (96)" from_port="example set output" to_op="Join (37)" to_port="left"/>
    <connect from_op="Join (37)" from_port="join" to_op="Select Attributes (97)" to_port="example set input"/>
    <connect from_op="Select Attributes (97)" from_port="example set output" to_op="Rename (76)" to_port="example set input"/>
    <connect from_op="Rename (76)" from_port="example set output" to_op="Trim (49)" to_port="example set input"/>
    <connect from_op="Trim (49)" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">1ST TRY</description>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples (61)" width="90" x="447" y="289">
    <parameter key="parameter_expression" value="!finds(DateOfObitRAW,&quot;[-.,?/]&quot;)"/>
    <parameter key="condition_class" value="expression"/>
    <list key="filters_list"/>
    <description align="center" color="transparent" colored="false" width="126">no punctuation</description>
    </operator>
    <operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="82" name="small punctuation fixes" width="90" x="581" y="340">
    <process expanded="true">
    <operator activated="true" class="trim" compatibility="8.2.000" expanded="true" height="82" name="Trim (50)" width="90" x="45" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="DateOfObitRAW"/>
    </operator>
    <operator activated="true" class="replace" compatibility="8.2.000" expanded="true" height="82" name="Replace (105)" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="DateOfObitRAW"/>
    <parameter key="replace_what" value="[()]|[.]\s+"/>
    <parameter key="replace_by" value=" "/>
    </operator>
    <operator activated="true" class="replace" compatibility="8.2.000" expanded="true" height="82" name="Replace (106)" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="DateOfObitRAW"/>
    <parameter key="replace_what" value="[-,]\s|\s\s+"/>
    <parameter key="replace_by" value=" "/>
    </operator>
    <operator activated="true" class="trim" compatibility="8.2.000" expanded="true" height="82" name="Trim (51)" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="DateOfObitRAW"/>
    </operator>
    <connect from_port="in 1" to_op="Trim (50)" to_port="example set input"/>
    <connect from_op="Trim (50)" from_port="example set output" to_op="Replace (105)" to_port="example set input"/>
    <connect from_op="Replace (105)" from_port="example set output" to_op="Replace (106)" to_port="example set input"/>
    <connect from_op="Replace (106)" from_port="example set output" to_op="Trim (51)" to_port="example set input"/>
    <connect from_op="Trim (51)" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">small punctuation fixes</description>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples (63)" width="90" x="715" y="340">
    <parameter key="parameter_expression" value="finds(prefix(DateOfObitRAW,1),&quot;[A-Z]&quot;)"/>
    <parameter key="condition_class" value="expression"/>
    <list key="filters_list"/>
    <description align="center" color="transparent" colored="false" width="126">starts with a capital letter</description>
    </operator>
    <operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples (62)" width="90" x="581" y="136">
    <parameter key="parameter_expression" value="!finds(DateOfObitRAW,&quot;[-.,?/]&quot;)"/>
    <list key="filters_list">
    <parameter key="filters_entry_key" value="DateOfObitRAW.starts_with.1"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">starts with a number (year only)</description>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (145)" width="90" x="715" y="34">
    <list key="function_descriptions">
    <parameter key="YearOfObit" value="DateOfObitRAW"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">YearOfObit</description>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes (99)" width="90" x="849" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="DateOfObitRAW"/>
    <parameter key="invert_selection" value="true"/>
    <description align="center" color="transparent" colored="false" width="126">get rid of DateOfObitRAW</description>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (146)" width="90" x="715" y="187">
    <list key="function_descriptions">
    <parameter key="YearOfObit" value="suffix(DateOfObitRAW,4)"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">YearOfObit</description>
    </operator>
    <operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="82" name="Subprocess (188)" width="90" x="849" y="187">
    <process expanded="true">
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (147)" width="90" x="45" y="34">
    <list key="function_descriptions">
    <parameter key="MonthOfObit" value="prefix(DateOfObitRAW,index(DateOfObitRAW,&quot; &quot;))"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">MonthOfObit</description>
    </operator>
    <operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve month lookup (4)" width="90" x="45" y="187">
    <parameter key="repository_entry" value="month lookup"/>
    </operator>
    <operator activated="true" class="join" compatibility="8.2.000" expanded="true" height="82" name="Join (38)" width="90" x="179" y="34">
    <parameter key="join_type" value="left"/>
    <parameter key="use_id_attribute_as_key" value="false"/>
    <list key="key_attributes">
    <parameter key="MonthOfObit" value="MonthRAW"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes (98)" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="MonthOfObit"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="rename" compatibility="8.2.000" expanded="true" height="82" name="Rename (77)" width="90" x="447" y="34">
    <parameter key="old_name" value="MonthMMM"/>
    <parameter key="new_name" value="MonthOfObit"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <connect from_port="in 1" to_op="Generate Attributes (147)" to_port="example set input"/>
    <connect from_op="Generate Attributes (147)" from_port="example set output" to_op="Join (38)" to_port="left"/>
    <connect from_op="Retrieve month lookup (4)" from_port="output" to_op="Join (38)" to_port="right"/>
    <connect from_op="Join (38)" from_port="join" to_op="Select Attributes (98)" to_port="example set input"/>
    <connect from_op="Select Attributes (98)" from_port="example set output" to_op="Rename (77)" to_port="example set input"/>
    <connect from_op="Rename (77)" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">parse out MonthOfObit</description>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (148)" width="90" x="983" y="187">
    <list key="function_descriptions">
    <parameter key="DayOfObit" value="cut(DateOfObitRAW,index(DateOfObitRAW,&quot; &quot;)+1,2)"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">DayOfObit</description>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes (100)" width="90" x="1117" y="187">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="DateOfObitRAW"/>
    <parameter key="invert_selection" value="true"/>
    <description align="center" color="transparent" colored="false" width="126">get rid of DateOfObitRAW</description>
    </operator>
    <operator activated="true" class="union" compatibility="8.2.000" expanded="true" height="82" name="Union (55)" width="90" x="1251" y="34"/>
    <connect from_port="in 1" to_op="Select Attributes (95)" to_port="example set input"/>
    <connect from_op="Select Attributes (95)" from_port="example set output" to_op="Generate Copy (12)" to_port="example set input"/>
    <connect from_op="Generate Copy (12)" from_port="example set output" to_op="Subprocess (187)" to_port="in 1"/>
    <connect from_op="Subprocess (187)" from_port="out 1" to_op="Filter Examples (61)" to_port="example set input"/>
    <connect from_op="Filter Examples (61)" from_port="example set output" to_op="Filter Examples (62)" to_port="example set input"/>
    <connect from_op="Filter Examples (61)" from_port="unmatched example set" to_op="small punctuation fixes" to_port="in 1"/>
    <connect from_op="small punctuation fixes" from_port="out 1" to_op="Filter Examples (63)" to_port="example set input"/>
    <connect from_op="Filter Examples (63)" from_port="example set output" to_port="out 2"/>
    <connect from_op="Filter Examples (63)" from_port="unmatched example set" to_port="out 3"/>
    <connect from_op="Filter Examples (62)" from_port="example set output" to_op="Generate Attributes (145)" to_port="example set input"/>
    <connect from_op="Filter Examples (62)" from_port="unmatched example set" to_op="Generate Attributes (146)" to_port="example set input"/>
    <connect from_op="Generate Attributes (145)" from_port="example set output" to_op="Select Attributes (99)" to_port="example set input"/>
    <connect from_op="Select Attributes (99)" from_port="example set output" to_op="Union (55)" to_port="example set 1"/>
    <connect from_op="Generate Attributes (146)" from_port="example set output" to_op="Subprocess (188)" to_port="in 1"/>
    <connect from_op="Subprocess (188)" from_port="out 1" to_op="Generate Attributes (148)" to_port="example set input"/>
    <connect from_op="Generate Attributes (148)" from_port="example set output" to_op="Select Attributes (100)" to_port="example set input"/>
    <connect from_op="Select Attributes (100)" from_port="example set output" to_op="Union (55)" to_port="example set 2"/>
    <connect from_op="Union (55)" from_port="union" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <portSpacing port="sink_out 3" spacing="0"/>
    <portSpacing port="sink_out 4" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="50" resized="false" width="119" x="1362" y="39">CLEAN</description>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="8.2.000" expanded="true" height="103" name="Append (8)" width="90" x="715" y="34"/>
    <operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="82" name="Subprocess (191)" width="90" x="849" y="34">
    <process expanded="true">
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (111)" width="90" x="45" y="34">
    <list key="function_descriptions">
    <parameter key="MonthOfObit" value="prefix(DateOfObitRAW,index(DateOfObitRAW,&quot; &quot;))"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">MonthOfObit</description>
    </operator>
    <operator activated="true" class="retrieve" compatibility="8.2.000" expanded="true" height="68" name="Retrieve month lookup (3)" width="90" x="45" y="187">
    <parameter key="repository_entry" value="month lookup"/>
    </operator>
    <operator activated="true" class="join" compatibility="8.2.000" expanded="true" height="82" name="Join (33)" width="90" x="179" y="34">
    <parameter key="join_type" value="left"/>
    <parameter key="use_id_attribute_as_key" value="false"/>
    <list key="key_attributes">
    <parameter key="MonthOfObit" value="MonthRAW"/>
    </list>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="8.2.000" expanded="true" height="82" name="Select Attributes (75)" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="MonthOfObit"/>
    <parameter key="invert_selection" value="true"/>
    </operator>
    <operator activated="true" class="rename" compatibility="8.2.000" expanded="true" height="82" name="Rename (56)" width="90" x="447" y="34">
    <parameter key="old_name" value="MonthMMM"/>
    <parameter key="new_name" value="MonthOfObit"/>
    <list key="rename_additional_attributes"/>
    </operator>
    <connect from_port="in 1" to_op="Generate Attributes (111)" to_port="example set input"/>
    <connect from_op="Generate Attributes (111)" from_port="example set output" to_op="Join (33)" to_port="left"/>
    <connect from_op="Retrieve month lookup (3)" from_port="output" to_op="Join (33)" to_port="right"/>
    <connect from_op="Join (33)" from_port="join" to_op="Select Attributes (75)" to_port="example set input"/>
    <connect from_op="Select Attributes (75)" from_port="example set output" to_op="Rename (56)" to_port="example set input"/>
    <connect from_op="Rename (56)" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (112)" width="90" x="983" y="34">
    <list key="function_descriptions">
    <parameter key="DayOfObit" value="cut(DateOfObitRAW,index(DateOfObitRAW,&quot; &quot;)+1,2)"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">DayOfObit</description>
    </operator>
    <operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="82" name="Subprocess (192)" width="90" x="1117" y="34">
    <process expanded="true">
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (113)" width="90" x="45" y="34">
    <list key="function_descriptions">
    <parameter key="YearOfObit" value="suffix(DateOfObitRAW,4)"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">YearOfObit</description>
    </operator>
    <operator activated="true" class="replace" compatibility="8.2.000" expanded="true" height="82" name="Replace (81)" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="YearOfObit"/>
    <parameter key="replace_what" value="[--]|\s"/>
    <description align="center" color="transparent" colored="false" width="126">get rid of space and dash in YearOfObit</description>
    </operator>
    <operator activated="true" class="declare_missing_value" compatibility="7.1.001" expanded="true" height="82" name="Declare Missing Value (58)" width="90" x="313" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="YearOfObit"/>
    <parameter key="mode" value="expression"/>
    <parameter key="expression_value" value="length(YearOfObit)&lt;4"/>
    <description align="center" color="transparent" colored="false" width="126">get rid of YearOfObit with length &amp;lt;4</description>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (114)" width="90" x="447" y="34">
    <list key="function_descriptions">
    <parameter key="YearOfObit" value="if(prefix(YearOfObit,2)==&quot;10&quot;,concat(&quot;19&quot;,suffix(YearOfObit,2)),YearOfObit)"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">fix year 10xx</description>
    </operator>
    <connect from_port="in 1" to_op="Generate Attributes (113)" to_port="example set input"/>
    <connect from_op="Generate Attributes (113)" from_port="example set output" to_op="Replace (81)" to_port="example set input"/>
    <connect from_op="Replace (81)" from_port="example set output" to_op="Declare Missing Value (58)" to_port="example set input"/>
    <connect from_op="Declare Missing Value (58)" from_port="example set output" to_op="Generate Attributes (114)" to_port="example set input"/>
    <connect from_op="Generate Attributes (114)" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">YearOfObit</description>
    </operator>
    <operator activated="true" class="subprocess" compatibility="8.2.000" expanded="true" height="82" name="Subprocess (193)" width="90" x="1251" y="34">
    <process expanded="true">
    <operator activated="true" class="trim" compatibility="8.2.000" expanded="true" height="82" name="Trim (36)" width="90" x="45" y="136"/>
    <operator activated="true" class="filter_examples" compatibility="8.2.000" expanded="true" height="103" name="Filter Examples (46)" width="90" x="179" y="136">
    <list key="filters_list">
    <parameter key="filters_entry_key" value="YearOfObit.is_not_missing."/>
    <parameter key="filters_entry_key" value="MonthOfObit.is_not_missing."/>
    <parameter key="filters_entry_key" value="DayOfObit.is_not_missing."/>
    <parameter key="filters_entry_key" value="DayOfObit.does_not_contain.?"/>
    <parameter key="filters_entry_key" value="DayOfObit.does_not_contain.-"/>
    <parameter key="filters_entry_key" value="DayOfObit.does_not_contain.\."/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">day, month and year of obit are not missing</description>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="8.2.000" expanded="true" height="82" name="Generate Attributes (115)" width="90" x="313" y="34">
    <list key="function_descriptions">
    <parameter key="DateOfObitPARSE" value="date_parse(concat(MonthOfObit,&quot;/&quot;,DayOfObit,&quot;/&quot;,YearOfObit))"/>
    </list>
    <description align="center" color="transparent" colored="false" width="126">create DateOfObitPARSE</description>
    </operator>
    <operator activated="true" class="date_to_nominal" compatibility="8.1.002" expanded="true" height="82" name="Date to Nominal (4)" width="90" x="447" y="34">
    <parameter key="attribute_name" value="DateOfObitPARSE"/>
    <parameter key="date_format" value="MMM dd, yyyy"/>
    <parameter key="time_zone" value="SYSTEM"/>
    </operator>
    <operator activated="true" class="nominal_to_date" compatibility="8.2.000" expanded="true" height="82" name="Nominal to Date (4)" width="90" x="581" y="34">
    <parameter key="attribute_name" value="DateOfObitPARSE"/>
    <parameter key="date_format" value="MMM dd, yyyy"/>
    <parameter key="time_zone" value="SYSTEM"/>
    </operator>
    <operator activated="true" class="union" compatibility="8.2.000" expanded="true" height="82" name="Union (41)" width="90" x="715" y="187"/>
    <connect from_port="in 1" to_op="Trim (36)" to_port="example set input"/>
    <connect from_op="Trim (36)" from_port="example set output" to_op="Filter Examples (46)" to_port="example set input"/>
    <connect from_op="Filter Examples (46)" from_port="example set output" to_op="Generate Attributes (115)" to_port="example set input"/>
    <connect from_op="Filter Examples (46)" from_port="unmatched example set" to_op="Union (41)" to_port="example set 2"/>
    <connect from_op="Generate Attributes (115)" from_port="example set output" to_op="Date to Nominal (4)" to_port="example set input"/>
    <connect from_op="Date to Nominal (4)" from_port="example set output" to_op="Nominal to Date (4)" to_port="example set input"/>
    <connect from_op="Nominal to Date (4)" from_port="example set output" to_op="Union (41)" to_port="example set 1"/>
    <connect from_op="Union (41)" from_port="union" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    </process>
    <description align="center" color="transparent" colored="false" width="126">DateOfObitPARSE</description>
    </operator>
    <operator activated="true" class="append" compatibility="8.2.000" expanded="true" height="103" name="Append (7)" width="90" x="1318" y="391"/>
    <connect from_port="in 1" to_op="generate DateOfObitRAW" to_port="in 1"/>
    <connect from_op="generate DateOfObitRAW" from_port="out 1" to_op="Filter Examples (59)" to_port="example set input"/>
    <connect from_op="Filter Examples (59)" from_port="example set output" to_op="Filter Examples (60)" to_port="example set input"/>
    <connect from_op="Filter Examples (59)" from_port="unmatched example set" to_op="Append (7)" to_port="example set 2"/>
    <connect from_op="Filter Examples (60)" from_port="example set output" to_op="Append (8)" to_port="example set 1"/>
    <connect from_op="Filter Examples (60)" from_port="unmatched example set" to_op="fix wonky DateOfObitRAW records" to_port="in 1"/>
    <connect from_op="fix wonky DateOfObitRAW records" from_port="out 1" to_op="Append (8)" to_port="example set 2"/>
    <connect from_op="fix wonky DateOfObitRAW records" from_port="out 2" to_port="out 2"/>
    <connect from_op="fix wonky DateOfObitRAW records" from_port="out 3" to_port="out 3"/>
    <connect from_op="Append (8)" from_port="merged set" to_op="Subprocess (191)" to_port="in 1"/>
    <connect from_op="Subprocess (191)" from_port="out 1" to_op="Generate Attributes (112)" to_port="example set input"/>
    <connect from_op="Generate Attributes (112)" from_port="example set output" to_op="Subprocess (192)" to_port="in 1"/>
    <connect from_op="Subprocess (192)" from_port="out 1" to_op="Subprocess (193)" to_port="in 1"/>
    <connect from_op="Subprocess (193)" from_port="out 1" to_op="Append (7)" to_port="example set 1"/>
    <connect from_op="Append (7)" from_port="merged set" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="source_in 2" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <portSpacing port="sink_out 3" spacing="0"/>
    <portSpacing port="sink_out 4" spacing="0"/>
    <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="674" y="156">delete me!!</description>
    <description align="center" color="yellow" colored="false" height="69" resized="false" width="126" x="835" y="112">fix missing DateOfObitRAW records</description>
    <description align="center" color="yellow" colored="false" height="50" resized="false" width="126" x="844" y="262">parse out MonthOfObit</description>
    </process>
    <description align="center" color="transparent" colored="false" width="126">DateOfObit</description>
    </operator>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    </process>
    </operator>
    </process>

     

    Scott

     

Sign In or Register to comment.