"[Text Mining] How to feed SGML format file into dictionary?"
<< Big Picture >>
(1) Documents ==> (2) Dictionary Creation ==> (3) Text Representation (based on either the number of the most frequently occurring words in the documents or Boolean, the exisistence of whether a specific topic words are appearing in the documents) ==> (4) Model Induction (e.g. rule-based induction) ==> (5) Document Classfication Rules
The input file is Reuters-21578 Text Catergorization Collection Data Set from UCI Machine Learning Repository, and the data set files are formated with SGM file tag.