Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Define spliting characters for tokenizer?
Hi!
I was playing around with the text plugin because it seemed to be the easiest way to try to run svms on the data I am working with and the example aready seem quite useful, but the StringTokenizer does too much splitting for my files, e.g. it splits stuff like "get_file" at "_", "c:\windows" at "\" etc...
Is there a way to tell it to split only on blank spaces, only on newlines, etc? I tried making my own Tokenizer, but sadly the given one only calls edu.udo.cs.wvtool.generic.tokenizer.StringTokenizer which comes from a library...
I was playing around with the text plugin because it seemed to be the easiest way to try to run svms on the data I am working with and the example aready seem quite useful, but the StringTokenizer does too much splitting for my files, e.g. it splits stuff like "get_file" at "_", "c:\windows" at "\" etc...
Is there a way to tell it to split only on blank spaces, only on newlines, etc? I tried making my own Tokenizer, but sadly the given one only calls edu.udo.cs.wvtool.generic.tokenizer.StringTokenizer which comes from a library...
0
Answers
I think going with a custom tokenizer is the way to go. You are right that the fact that most of the text plugin is hidden in a library obstructs extending it easily. The text operators are migrated into the core and you can check with 5.0 whether implementing a custom tokenizer becomes easier for you.
Best,
Simon
here's another way around, if you can't wait for the next version:
- Store the texts in a nominal attribute in an example set
- Use the split operator to split the texts according to your needs and distribute it over a number of attributes
- Use the MissingValueReplenishment operator to exchange missing values with a blank " ".
- Either change all generated nominal attributes to String attributes or use the filter_nominal_attributes parameter of the string text input.
- Perform the text input, but don't use a String tokenizer. All tokenization already is carried out beforehand.
This should do the trick...Greetings,
Sebastian
*scratch* okay, I wanted to try that but i noticed how little I actually know about what I can do...
Currently I have a lot of file looking like this: Right now I can't even find the way to load them in a way to have the text as attributes (and the file names as ids and the folder as label or something)...
I had the same request as silentguy. I managed to hack together a new operator which lets the user specify which characters should be used as separators. You can download the modified plugin here: http://www.filedropper.com/rapidminer-advancedstringtokenizer-46
It's a temporary solution until 5.0 comes out, but it does the trick for me. It's still probably full of bugs though, so don't use it to secure a nuclear plant ;D
PM me if you want the source code or find any bugs!
I downloaded the text processing add-on for RM 5.0 and found that it contains only the simple Tokenize block which splits at any non-letter character. I want to be able to define my splitting characters (e.g., split only at whitespace and square brackets) so that RM will not split terms such as netshare1_user1 (which I want to keep as a single term).
Will this feature be offered--if so, when is the plan to release it?
Thanks,
David
looks like it is not worth making a plugin for a single operator :-) But of course, if you send me the source code, I'd love to build it into the next release of the text extension. Unfortunately, the link seems to be no longer working.
Cheers,
Simon
Indeed, for some reason my promotional 30 year account of 250 GB of storage space at filedropper seems to have been terminated
Anyhow, I ported the operator to be compatible with RM 5.0. For now, you can download it at http://www.megaupload.com/?d=0WP7NMQG (source included.)
I added the following code to the package com.rapidminer.operator.text.io.tokenizer : Basically, 2 lines changed from the original StringTokenizerOperator! It's all yours now.
- R
thank you very much. I will include it into the tokenizer now.
Greetings,
Sebastian