"text-processing: extract dates from documents"

gero_schwenkgero_schwenk Member Posts: 6 Contributor II
edited May 2019 in Help
Hello together!
I've got a question regarding the extraction of dates from documents and would be very happy for help... :)

My problem is as follows: I want to crawl and process webcontent for subsequent classification. Besides other things, I sure would like to organize the documents by date in order to look for trends or link them to external events. In order to do this, I need to extract dates from them (that is the html-document or the documents content itself.)

Can anybody give me a hint how to achieve this? I've seen that there is a "Extract Information"-Operator, but I don't know how to use it to achieve my goal... :( (I cant let it match a list of possible dates, which was my first idea...)

Any help is greatly appreciated!
Cheers,
Gero

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Gero,
    I think a combination of cut document and extract information operator will help you. Unfortunately it is a little bit tricky to combine these to match a certain document structure. If the date is content of a div tag, try to use XPath Expressions specifiying this tag.
    I will give a webinar on this topic on Wednesday, where I will show this in practice. More specific I will show how to extract posts from this forum and the poster as well as the date. There are still open slots for participating.

    Greetings,
    Β  Sebastian
  • gero_schwenkgero_schwenk Member Posts: 6 Contributor II
    Hi Sebastian!
    Thanks for the hint and your invitation! Unfortunately, I'm on travel on wednesday, so that I will miss it... Just to ask wether I get the idea: You suppose that I

    1) look, for instance, for passages which start with a number between 1 and 31 and end with a 10 using regular regions in the "cut document" operator and

    2) exctract that passages using "extract information" and save them as an attribute (the date) and finally

    3) join the table with the new date-attribute with the original term-document-matrix by document ID.

    Am I right with this - at least in principle?
    Many thanks again and cheers:
    Gero
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    in principle yes, but you should really take a look on XPath, for example in wikipedia.

    Greetings,
    Β  Sebastian
  • gero_schwenkgero_schwenk Member Posts: 6 Contributor II
    hi sebastian!
    thanks for the hint! I'll get into it...

    cheers,
    gero
Sign In or Register to comment.