Problem with special preprocessing of texts
Removing links/URL and Hash tags Tweet may
contain URL, hash tags and words start with ‘@’
character. We removed these entities since found no
significance in our scoring approach.
Replacing word with contractions Contractions such as
‘didn’t’, ‘ain’t’ ‘couldn’t’ are common in tweets.
Elongation replacer People often use elongation like
‘loooooooove’ to emphasise words. Elongation can be
at the beginning (‘ooooooh’), end (‘toooooo’) or in
example ooooooooh what a coooooool breeze => ooh what a cool breeze
WordNet Lemmatizing Wordnet lemmatizer is used to
get a valid meaningful root word. Each word (except
slang/abbreviation) is lemmatized after tokenizing.
Explicit negation handling We used an antonym
replacer using WordNet to replace word preceded by
‘not,’ ‘never,’ etc.