tokenizing by sentences and learning algorithms

lavramu · October 2013

Hi,

Thanks for the help so far. I have another question and I am sorry to bother this way.

Most of the tutorials and problems I have seen so far in text classification through machine learning using rapidminer have used word vectors and tokenized text files into words before running any learning algo. Now my problem does not need words but sentences. For this I use the tokenize and select linguistic sentences and I try to run the learning algo. So the text files containg sentences and are tokenized into sentences and not words.

Will this work similarly? How is this different? I know that Perl's naive bayes allows this.

Second question is, what is the minimum data needed in order to be able to make an algo learn?

Third question is, (more important) Is there a difference between these two :
1) read in the text files (with appropriate class ) --> tokenize by sentence --> learning algo
2) read in the text files --> tokenize --> write to disk where each file has one sentence --> read each of these files --> leaning algo

(basically I am trying to understand if tokenize ensures that the learning algo takes in sentence by sentence here)

I dint want to startup new threads and hence put down this. Please help thanks!

tokenizing by sentences and learning algorithms

Categories