Extracting date from textfiles

TimoTimo Member Posts: 2 Contributor I
Hi everybody,

my name is Timo and I would be glad if you could please help me with my problem:

I have a lot of textfiles, especially press releases from different firms, and I would like to extract the date out of these press releases.
The problem is, that there is no standard format for the date, i.e. sometimes it's "14.08.2008" and sometimes "04 November 05" or "14 November 2005".

I know how to tokenize, generate n-grams,... and so on, but I don't know how I can extract the date Information from these files.
My idea was to work with the "generate n-grams" operator, but I don't know which Regex I have to insert.

Maybe you could help me :)

Thank you very much!

Timo

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,497 RM Data Scientist
    Hi,

    it is very hard to work with different timestamp-standards. I guess you need to go the complex way and filter out the dates via different Regex. and then Loop with Generate attribute and parse them.

    Someting like [0-9][0-9]\.[0-9][0-9]\.[0-9]+ for the first one or some thing. Maybe Keep Documents part is the easiest operator to do this..

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Following on from Martin's note. 
    Here's a very quick example of a couple of RegEx ways to extract the dates & format them. 
    It uses Cut Document & Select Subprocess to allow you to add more date formats as you write the RegEx expressions.  In this example it only selects the first date it finds in the document (as with a press release that's likely to be at the top).
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.4.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.4.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="subprocess" compatibility="6.4.000" expanded="true" height="76" name="Example Documents" width="90" x="45" y="120">
            <process expanded="true">
              <operator activated="true" class="text:create_document" compatibility="6.4.001" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="390">
                <parameter key="text" value="This is a press release from 12/05/2010&#10;sfsfsd&#10;sdfsdfsd&#10;fsdgsdgsd g sdg sdfg dfgg"/>
              </operator>
              <operator activated="true" class="text:create_document" compatibility="6.4.001" expanded="true" height="60" name="Create Document" width="90" x="45" y="255">
                <parameter key="text" value="This is a press release from Monday 12th May 2010&#10;sfsfsd&#10;sdfsdfsd&#10;fsdgsdgsd g sdg sdfg dfgg"/>
              </operator>
              <operator activated="true" class="text:documents_to_data" compatibility="6.4.001" expanded="true" height="94" name="Documents to Data" width="90" x="179" y="300">
                <parameter key="text_attribute" value="press_release"/>
              </operator>
              <connect from_op="Create Document (2)" from_port="output" to_op="Documents to Data" to_port="documents 2"/>
              <connect from_op="Create Document" from_port="output" to_op="Documents to Data" to_port="documents 1"/>
              <connect from_op="Documents to Data" from_port="example set" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="generate_id" compatibility="6.4.000" expanded="true" height="76" name="Generate ID" width="90" x="179" y="120"/>
          <operator activated="true" class="multiply" compatibility="6.4.000" expanded="true" height="94" name="Multiply" width="90" x="313" y="210"/>
          <operator activated="true" class="text:process_document_from_data" compatibility="6.4.001" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="75">
            <parameter key="create_word_vector" value="false"/>
            <parameter key="keep_text" value="true"/>
            <parameter key="select_attributes_and_weights" value="true"/>
            <list key="specify_weights">
              <parameter key="press_release" value="1.0"/>
            </list>
            <process expanded="true">
              <operator activated="true" class="text:cut_document" compatibility="6.4.001" expanded="true" height="60" name="Cut Document" width="90" x="112" y="30">
                <parameter key="query_type" value="Regular Expression"/>
                <list key="string_machting_queries"/>
                <list key="regular_expression_queries">
                  <parameter key="1" value="((([0-9]|[0-9])|([0-3][0-9]))(/)(([0-9]|[0-9])|([0-9][0-9]))(/)([1-2][0-9][0-9][0-9]))"/>
                  <parameter key="2" value="((([0-9]|[0-9])|([0-9][0-9]))(...)(January|February|March|April|May|June|July|August|September|October|November|December)(.)([1-2][0-9][0-9][0-9]))"/>
                </list>
                <list key="regular_region_queries"/>
                <list key="xpath_queries"/>
                <list key="namespaces"/>
                <list key="index_queries"/>
                <list key="jsonpath_queries"/>
                <process expanded="true">
                  <operator activated="true" class="text:documents_to_data" compatibility="6.4.001" expanded="true" height="76" name="Documents to Data (3)" width="90" x="112" y="30">
                    <parameter key="text_attribute" value="dateformat"/>
                  </operator>
                  <operator activated="true" class="extract_macro" compatibility="6.4.000" expanded="true" height="60" name="Extract Macro" width="90" x="179" y="210">
                    <parameter key="macro" value="query_key"/>
                    <parameter key="macro_type" value="data_value"/>
                    <parameter key="attribute_name" value="query_key"/>
                    <parameter key="example_index" value="1"/>
                    <list key="additional_macros"/>
                    <description align="center" color="transparent" colored="false" width="126">Extract the date format type for the subprocess selection</description>
                  </operator>
                  <operator activated="true" class="text_to_nominal" compatibility="6.4.000" expanded="true" height="76" name="Text to Nominal" width="90" x="313" y="210">
                    <parameter key="attribute_filter_type" value="single"/>
                    <parameter key="attribute" value="dateformat"/>
                  </operator>
                  <operator activated="false" class="handle_exception" compatibility="6.4.000" expanded="true" height="60" name="Handle Exception" width="90" x="514" y="390">
                    <process expanded="true">
                      <portSpacing port="source_in 1" spacing="0"/>
                      <portSpacing port="sink_out 1" spacing="0"/>
                    </process>
                    <process expanded="true">
                      <portSpacing port="source_in 1" spacing="0"/>
                      <portSpacing port="sink_out 1" spacing="0"/>
                    </process>
                    <description align="center" color="transparent" colored="false" width="126">You should really use 'Handle Exception' around the 'Select Subprocess' as there are bound to be some extracted dates that don't parse. Left disabled for illustration.</description>
                  </operator>
                  <operator activated="true" class="select_subprocess" compatibility="6.4.000" expanded="true" height="76" name="Select Subprocess" width="90" x="514" y="210">
                    <parameter key="select_which" value="%{query_key}"/>
                    <process expanded="true">
                      <operator activated="true" class="nominal_to_date" compatibility="6.4.000" expanded="true" height="76" name="Nominal to Date" width="90" x="112" y="30">
                        <parameter key="attribute_name" value="dateformat"/>
                        <parameter key="date_format" value="dd/MM/yyyy"/>
                      </operator>
                      <connect from_port="input 1" to_op="Nominal to Date" to_port="example set input"/>
                      <connect from_op="Nominal to Date" from_port="example set output" to_port="output 1"/>
                      <portSpacing port="source_input 1" spacing="0"/>
                      <portSpacing port="source_input 2" spacing="0"/>
                      <portSpacing port="sink_output 1" spacing="0"/>
                      <portSpacing port="sink_output 2" spacing="0"/>
                    </process>
                    <process expanded="true">
                      <operator activated="true" class="nominal_to_date" compatibility="6.4.000" expanded="true" height="76" name="Nominal to Date (2)" width="90" x="179" y="30">
                        <parameter key="attribute_name" value="dateformat"/>
                        <parameter key="date_format" value="dd'th' MMMMM yyyy"/>
                      </operator>
                      <connect from_port="input 1" to_op="Nominal to Date (2)" to_port="example set input"/>
                      <connect from_op="Nominal to Date (2)" from_port="example set output" to_port="output 1"/>
                      <portSpacing port="source_input 1" spacing="0"/>
                      <portSpacing port="source_input 2" spacing="0"/>
                      <portSpacing port="sink_output 1" spacing="0"/>
                      <portSpacing port="sink_output 2" spacing="0"/>
                    </process>
                    <description align="center" color="transparent" colored="false" width="126">Create a subprocess to parse each date format.</description>
                  </operator>
                  <operator activated="true" class="text:data_to_documents" compatibility="6.4.001" expanded="true" height="60" name="Data to Documents" width="90" x="447" y="75">
                    <list key="specify_weights"/>
                  </operator>
                  <operator activated="true" class="text:combine_documents" compatibility="6.4.001" expanded="true" height="76" name="Combine Documents" width="90" x="648" y="75"/>
                  <connect from_port="segment" to_op="Documents to Data (3)" to_port="documents 1"/>
                  <connect from_op="Documents to Data (3)" from_port="example set" to_op="Extract Macro" to_port="example set"/>
                  <connect from_op="Extract Macro" from_port="example set" to_op="Text to Nominal" to_port="example set input"/>
                  <connect from_op="Text to Nominal" from_port="example set output" to_op="Select Subprocess" to_port="input 1"/>
                  <connect from_op="Select Subprocess" from_port="output 1" to_op="Data to Documents" to_port="example set"/>
                  <connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
                  <connect from_op="Combine Documents" from_port="document" to_port="document 1"/>
                  <portSpacing port="source_segment" spacing="0"/>
                  <portSpacing port="sink_document 1" spacing="0"/>
                  <portSpacing port="sink_document 2" spacing="0"/>
                </process>
                <description align="center" color="transparent" colored="false" width="126">The following formats are supported: &lt;br/&gt;1 : dd/MM/yyyy&lt;br/&gt;2 : dd'th' MMMMM yyyy</description>
              </operator>
              <operator activated="true" class="text:combine_documents" compatibility="6.4.001" expanded="true" height="76" name="Combine Documents (2)" width="90" x="313" y="30"/>
              <connect from_port="document" to_op="Cut Document" to_port="document"/>
              <connect from_op="Cut Document" from_port="documents" to_op="Combine Documents (2)" to_port="documents 1"/>
              <connect from_op="Combine Documents (2)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
            <description align="center" color="transparent" colored="false" width="126">Magic happens here. :)</description>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="6.4.000" expanded="true" height="76" name="Select Attributes" width="90" x="648" y="120">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="text"/>
            <parameter key="invert_selection" value="true"/>
            <parameter key="include_special_attributes" value="true"/>
          </operator>
          <operator activated="true" class="join" compatibility="6.4.000" expanded="true" height="76" name="Join" width="90" x="648" y="255">
            <list key="key_attributes"/>
          </operator>
          <connect from_op="Example Documents" from_port="out 1" to_op="Generate ID" to_port="example set input"/>
          <connect from_op="Generate ID" from_port="example set output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_op="Join" to_port="left"/>
          <connect from_op="Join" from_port="join" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,497 RM Data Scientist
    wow, nice one!

    I got a new building block! Thanks!
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • TimoTimo Member Posts: 2 Contributor I
    Hi Guys,

    thank you very, very much for your help!

    JEdward, your process is awesome, i couldn't have done this by myself :)
    It works perfect!

    Timo
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Thanks very much guys.  *blush*
Sign In or Register to comment.