"text-processing: extract dates from documents"

gero_schwenk · August 2010

Hello together!
I've got a question regarding the extraction of dates from documents and would be very happy for help...

My problem is as follows: I want to crawl and process webcontent for subsequent classification. Besides other things, I sure would like to organize the documents by date in order to look for trends or link them to external events. In order to do this, I need to extract dates from them (that is the html-document or the documents content itself.)

Can anybody give me a hint how to achieve this? I've seen that there is a "Extract Information"-Operator, but I don't know how to use it to achieve my goal...

(I cant let it match a list of possible dates, which was my first idea...)

Any help is greatly appreciated!
Cheers,
Gero

land · August 2010

Hi Gero,
I think a combination of cut document and extract information operator will help you. Unfortunately it is a little bit tricky to combine these to match a certain document structure. If the date is content of a div tag, try to use XPath Expressions specifiying this tag.
I will give a webinar on this topic on Wednesday, where I will show this in practice. More specific I will show how to extract posts from this forum and the poster as well as the date. There are still open slots for participating.

Greetings,
Sebastian

gero_schwenk · August 2010

Hi Sebastian!
Thanks for the hint and your invitation! Unfortunately, I'm on travel on wednesday, so that I will miss it... Just to ask wether I get the idea: You suppose that I

1) look, for instance, for passages which start with a number between 1 and 31 and end with a 10 using regular regions in the "cut document" operator and

2) exctract that passages using "extract information" and save them as an attribute (the date) and finally

3) join the table with the new date-attribute with the original term-document-matrix by document ID.

Am I right with this - at least in principle?
Many thanks again and cheers:
Gero

land · August 2010

Hi,
in principle yes, but you should really take a look on XPath, for example in wikipedia.

Greetings,
Sebastian

gero_schwenk · August 2010

hi sebastian!
thanks for the hint! I'll get into it...

cheers,
gero

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"text-processing: extract dates from documents"

Answers