"Text Plugin - StringTokenizer doesn't tokenize alphanumeric strings"

James · September 2009

I'm using the StringTokenizer (part of the Text plugin (v. 4.5)) to try and create tokens from text. The problem I am having is that my text documents contain alphanumeric codes that I would like to tokenize. It appears the StringTokenizer only tokenizes words and removes any numeric or special characters.

For example, within my text document I may have three alphanumeric codes as shown below. Using the StringTokenizer will result in a single token called "C" (all numbers and special characters are removed). What I would like is for the StringTokenizer to find two tokens ("C847_0" and "C372_-1") for this document.

C847_0  C372_-1  C847_0

Are there any options in the StringTokenizer that I can set to allow alphanumeric tokens? Or is there an operator that simply creates attributes by splitting on spaces and then I could simply filter out whatever type of attributes I don't need (e.g., purely numeric tokens)?

Any help would be appreciated.

Thanks.

Ryujakk · October 2009

Hi there,

I had the exact same problem. One solution might be to download RapidMiner 5 beta, which should be available quite soon according to Tobias Malbrecht. Or you can try the text plugin extension I wrote to solve this. It might still be buggy though...

Here is the link if you want to try it out: http://www.filedropper.com/rapidminer-advancedstringtokenizer-46

Just download the jar, and copy it to "RapidMiner\lib\plugins" Remove the previous text plugin first though!

- R

land · October 2009

Hi,
although not official yet, the beta is already downloadable on source forge.

Greetings,
Sebastian

James · October 2009

Hi,

Ryujakk - thanks for the link. I'm unable to download the plugin right now because my work blocks access to filedropper.com, but I will try it out later.

Sebastian - thank you for the information about the beta; I went ahead and installed the software (it looks great!). Are the text operators part of the core in version 5.0? (The link below mentions the operators may be part of 5.0.) If so, I'm having trouble locating the operators.

http://rapid-i.com/rapidforum/index.php/topic,1183.0.html

If not, do you know if a version 5.0 of the text plugin will be released soon? The reason I ask is because when I "install" version 4.5 (or 4.6) of the text plugin in RM5.0B, I can't find the operators anywhere.

Thanks so much your help.

land · October 2009

Hi,
the RapidMiner Beta does not support plugins. I'm currently working on adapting (and extending) the mechanism, so that it works with 5.0.

It's true, that the new text processing operators (which are NOT just integrated from the TextPlugin, but have been completely redesigned) have been removed again from core in the beta and will be published separately in the near future.

Greetings,
Sebastian

James · October 2009

Sebastian,

Thanks for the info about plugins not being supported in the beta. I look forward to seeing what the extended text operators will be capable of when they are released.

Until the new text processing operators are released I will use the plugin from Ryujakk (thanks again!). Ryuajakk's AdvancedStringTokenizer did just what I needed; it was able to tokenize my text using spaces while retaining numeric and special characters.

Thanks for your help.

James
**********

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Text Plugin - StringTokenizer doesn't tokenize alphanumeric strings"

Answers