"Retaining selected word pairs when tokenizing"
When tokenizing into single word tokens, is there a way to keep selected pairs of words together as a single token?
For example, in soccer the term "centre forward" makes more sense as a single token. I looked at n-grams, but this pairs words that I do not want to pair. I tried using the stem dictionary, but this seems not to work across multiple tokens, and if I put the stem before tokenize, e.g. to change centre forward to centre-forward, this doesn't appear to work.