The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

Split text tokens where words have concatenated

mobmob Member Posts: 37 Contributor II
edited June 2019 in Help
I have text tokens like
stylesexploration
expressionresearch
technologypractice
curriculaimprovisationsurvey

where the punctuation and or spaces are missing in the original text
Besides using a list of replace "expressionresearch" with 2 tokens "expression" &  "research"  is there a smarter way to handle the situation

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,525 RM Data Scientist
    What ever you do, you need to have a list of words, which can be inside. Some kind of dictionary.

    then you might do things using some Generate Attributes functions like contains or find or so..


    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    One approach might be to try a word tokenizer for non-English characters such as Jieba (link below).  You can then provide it with your own dictionary of words to split by.  Hope that helps.

    https://github.com/fxsjy/jieba
Sign In or Register to comment.