Split text tokens where words have concatenated

I have text tokens like

where the punctuation and or spaces are missing in the original text
Besides using a list of replace "expressionresearch" with 2 tokens "expression" &  "research"  is there a smarter way to handle the situation


  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,483 RM Data Scientist
    What ever you do, you need to have a list of words, which can be inside. Some kind of dictionary.

    then you might do things using some Generate Attributes functions like contains or find or so..

    One approach might be to try a word tokenizer for non-English characters such as Jieba (link below).  You can then provide it with your own dictionary of words to split by.  Hope that helps.

