Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Extracting date from textfiles
Hi everybody,
my name is Timo and I would be glad if you could please help me with my problem:
I have a lot of textfiles, especially press releases from different firms, and I would like to extract the date out of these press releases.
The problem is, that there is no standard format for the date, i.e. sometimes it's "14.08.2008" and sometimes "04 November 05" or "14 November 2005".
I know how to tokenize, generate n-grams,... and so on, but I don't know how I can extract the date Information from these files.
My idea was to work with the "generate n-grams" operator, but I don't know which Regex I have to insert.
Maybe you could help me
Thank you very much!
Timo
my name is Timo and I would be glad if you could please help me with my problem:
I have a lot of textfiles, especially press releases from different firms, and I would like to extract the date out of these press releases.
The problem is, that there is no standard format for the date, i.e. sometimes it's "14.08.2008" and sometimes "04 November 05" or "14 November 2005".
I know how to tokenize, generate n-grams,... and so on, but I don't know how I can extract the date Information from these files.
My idea was to work with the "generate n-grams" operator, but I don't know which Regex I have to insert.
Maybe you could help me
Thank you very much!
Timo
0
Answers
it is very hard to work with different timestamp-standards. I guess you need to go the complex way and filter out the dates via different Regex. and then Loop with Generate attribute and parse them.
Someting like [0-9][0-9]\.[0-9][0-9]\.[0-9]+ for the first one or some thing. Maybe Keep Documents part is the easiest operator to do this..
Cheers,
Martin
Dortmund, Germany
Here's a very quick example of a couple of RegEx ways to extract the dates & format them.
It uses Cut Document & Select Subprocess to allow you to add more date formats as you write the RegEx expressions. In this example it only selects the first date it finds in the document (as with a press release that's likely to be at the top).
I got a new building block! Thanks!
Dortmund, Germany
thank you very, very much for your help!
JEdward, your process is awesome, i couldn't have done this by myself
It works perfect!
Timo