"performance - data diet - compress information"
i am analyzing 30K documents creating a term frequency matrix with around 40K attributes which results into 1,2 billion datapoints. that seems too much for my 2gh duo core 4 gb ram macbook. even if the results are computed it needs endless time to load the result perspective. is there a process which is able to summarize redundant data? as far as the creation of the term frenquency matrix is concerned i think it wouldn't make any sense to change the pruning factor or something like that. crucial information would be lost. but a lot of documents refer to the same date for example. is it somehow possible to summarize data of one date in one line? sorry for my english, it is a little tricky to explain: now i have 30K which indeed just refer to one year, therefore actually only 365 lines would be necessary. on the other hand of course there is the problem with tokens, having a similar meaning but appear in different forms in the tf matrix. i have tried stemming of course but i am not very happy with the result. most of the words are really crippled by the concerning operators. are there any stemming decitionaries around which transform verbs for example to nouns. this would somehow conserve the logical interaction of words but in the same way kind of unify them and reduce the data set.
any help is appreciated! have a good night, or good morning (depending on when you read this)