"Text Plugin - StringTokenizer doesn't tokenize alphanumeric strings"

JamesJames Member Posts: 4 Contributor I
edited May 23 in Help
I'm using the StringTokenizer (part of the Text plugin (v. 4.5)) to try and create tokens from text.  The problem I am having is that my text documents contain alphanumeric codes that I would like to tokenize.  It appears the StringTokenizer only tokenizes words and removes any numeric or special characters.

For example, within my text document I may have three alphanumeric codes as shown below.  Using the StringTokenizer will result in a single token called "C" (all numbers and special characters are removed).  What I would like is for the StringTokenizer to find two tokens ("C847_0" and "C372_-1") for this document.
C847_0  C372_-1  C847_0
Are there any options in the StringTokenizer that I can set to allow alphanumeric tokens?  Or is there an operator that simply creates attributes by splitting on spaces and then I could simply filter out whatever type of attributes I don't need (e.g., purely numeric tokens)?

Any help would be appreciated. 

Thanks.

Answers

  • RyujakkRyujakk Member Posts: 17  Maven
    Hi there,

    I had the exact same problem. One solution might be to download RapidMiner 5 beta, which should be available quite soon according to Tobias Malbrecht. Or you can try the text plugin extension I wrote to solve this. It might still be buggy though...

    Here is the link if you want to try it out: http://www.filedropper.com/rapidminer-advancedstringtokenizer-46

    Just download the jar, and copy it to "RapidMiner\lib\plugins" Remove the previous text plugin first though!

    - R
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,525   Unicorn
    Hi,
    although not official yet, the beta is already downloadable on source forge.

    Greetings,
      Sebastian
  • JamesJames Member Posts: 4 Contributor I
    Hi,

    Ryujakk - thanks for the link.  I'm unable to download the plugin right now because my work blocks access to filedropper.com, but I will try it out later.

    Sebastian - thank you for the information about the beta; I went ahead and installed the software (it looks great!).  Are the text operators part of the core in version 5.0? (The link below mentions the operators may be part of 5.0.)  If so, I'm having trouble locating the operators.

    http://rapid-i.com/rapidforum/index.php/topic,1183.0.html

    If not, do you know if a version 5.0 of the text plugin will be released soon? The reason I ask is because when I "install" version 4.5 (or 4.6) of the text plugin in RM5.0B, I can't find the operators anywhere.

    Thanks so much your help.
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,525   Unicorn
    Hi,
    the RapidMiner Beta does not support plugins. I'm currently working on adapting (and extending) the mechanism, so that it works with 5.0.

    It's true, that the new text processing operators (which are NOT just integrated from the TextPlugin, but have been completely redesigned) have been removed again from core in the beta and will be published separately in the near future.

    Greetings,
      Sebastian
  • JamesJames Member Posts: 4 Contributor I
    Sebastian,

    Thanks for the info about plugins not being supported in the beta.  I look forward to seeing what the extended text operators will be capable of when they are released.

    Until the new text processing operators are released I will use the plugin from Ryujakk (thanks again!).  Ryuajakk's AdvancedStringTokenizer did just what I needed; it was able to tokenize my text using spaces while retaining numeric and special characters.

    Thanks for your help.

    James
    **********

Sign In or Register to comment.