Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
tokenizing by sentences and learning algorithms
Hi,
Thanks for the help so far. I have another question and I am sorry to bother this way.
Most of the tutorials and problems I have seen so far in text classification through machine learning using rapidminer have used word vectors and tokenized text files into words before running any learning algo. Now my problem does not need words but sentences. For this I use the tokenize and select linguistic sentences and I try to run the learning algo. So the text files containg sentences and are tokenized into sentences and not words.
Will this work similarly? How is this different? I know that Perl's naive bayes allows this.
Second question is, what is the minimum data needed in order to be able to make an algo learn?
Third question is, (more important) Is there a difference between these two :
1) read in the text files (with appropriate class ) --> tokenize by sentence --> learning algo
2) read in the text files --> tokenize --> write to disk where each file has one sentence --> read each of these files --> leaning algo
(basically I am trying to understand if tokenize ensures that the learning algo takes in sentence by sentence here)
I dint want to startup new threads and hence put down this. Please help thanks!
Thanks for the help so far. I have another question and I am sorry to bother this way.
Most of the tutorials and problems I have seen so far in text classification through machine learning using rapidminer have used word vectors and tokenized text files into words before running any learning algo. Now my problem does not need words but sentences. For this I use the tokenize and select linguistic sentences and I try to run the learning algo. So the text files containg sentences and are tokenized into sentences and not words.
Will this work similarly? How is this different? I know that Perl's naive bayes allows this.
Second question is, what is the minimum data needed in order to be able to make an algo learn?
Third question is, (more important) Is there a difference between these two :
1) read in the text files (with appropriate class ) --> tokenize by sentence --> learning algo
2) read in the text files --> tokenize --> write to disk where each file has one sentence --> read each of these files --> leaning algo
(basically I am trying to understand if tokenize ensures that the learning algo takes in sentence by sentence here)
I dint want to startup new threads and hence put down this. Please help thanks!
0