🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

Split text tokens where words have concatenated

mobmob Member Posts: 37 Contributor I
edited June 19 in Help
I have text tokens like
stylesexploration
expressionresearch
technologypractice
curriculaimprovisationsurvey

where the punctuation and or spaces are missing in the original text
Besides using a list of replace "expressionresearch" with 2 tokens "expression" &  "research"  is there a smarter way to handle the situation

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,155  RM Data Scientist
    What ever you do, you need to have a list of words, which can be inside. Some kind of dictionary.

    then you might do things using some Generate Attributes functions like contains or find or so..


    ~Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 564   Unicorn
    One approach might be to try a word tokenizer for non-English characters such as Jieba (link below).  You can then provide it with your own dictionary of words to split by.  Hope that helps.

    https://github.com/fxsjy/jieba
Sign In or Register to comment.