Counting date within textfiles

philippwphilippw Member Posts: 3 Contributor I
edited November 2018 in Help
Hi everybody,

my name is Philipp and i am new to Rapidminer. I am sure you can help me with this problem:

I have a lot of textfiles (press releases) from different companies. For every company there is exactly one pdf-file containing a specific number of press releases. For every company i would like to measure the press releases´ frequency (showing me for every company on what date the press releases have been announced).

Therefore, I would like to count the date within the press releases´ pdf-files. Fortunately, since i have been using a specific database, every press release follows the same structure (also for the date). Standard format for date always is: (1-31) (January-December) (2003-2015), e.g. "1 January 2003"; "2 January 2003"...

The desired outcome should be the following: excel file with examples/rows for every company´s pdf file and 4758 attributes/columns (one attribute for every possible date for my time period 2003-2015 (13x366)). Within the acutal cells there should be the summarized number of press releases (for each company how many press releases on each possible date have been announced).

So far i have tried the following: Creating a wordlist with every possible date. Then tokenization and generating n-grams (length) = 3 for the press releases. Unfortunately, although there is no problem with tokenization and generating n-grams, for some reason this doen´t work. Maybe there is some kind of possilbe keyword-match within the text files (which I haven´t found so far)?

Attached you´ll find how the desired outcome and one press release´s header look like.



Thank you very much in advance!

Sign In or Register to comment.