Define spliting characters for tokenizer?

silentguy · August 2009

Hi!
I was playing around with the text plugin because it seemed to be the easiest way to try to run svms on the data I am working with and the example aready seem quite useful, but the StringTokenizer does too much splitting for my files, e.g. it splits stuff like "get_file" at "_", "c:\windows" at "\" etc...
Is there a way to tell it to split only on blank spaces, only on newlines, etc? I tried making my own Tokenizer, but sadly the given one only calls edu.udo.cs.wvtool.generic.tokenizer.StringTokenizer which comes from a library...

fischer · August 2009

Hi,

I think going with a custom tokenizer is the way to go. You are right that the fact that most of the text plugin is hidden in a library obstructs extending it easily. The text operators are migrated into the core and you can check with 5.0 whether implementing a custom tokenizer becomes easier for you.

Best,
Simon

land · August 2009

Hi,
here's another way around, if you can't wait for the next version:

Store the texts in a nominal attribute in an example set
Use the split operator to split the texts according to your needs and distribute it over a number of attributes
Use the MissingValueReplenishment operator to exchange missing values with a blank " ".
Either change all generated nominal attributes to String attributes or use the filter_nominal_attributes parameter of the string text input.
Perform the text input, but don't use a String tokenizer. All tokenization already is carried out beforehand.

This should do the trick...

Greetings,
Sebastian

silentguy · August 2009

Whoops, missed the second post... thanks for the tip...
*scratch* okay, I wanted to try that but i noticed how little I actually know about what I can do...
Currently I have a lot of file looking like this:

<Prozess 1>
<Thread 1:1>
get_file_attributes()
create_file()
vm_read()
vm_read()
vm_read()
vm_read()
vm_read()
vm_read()
enum_modules()
</Thread 1>
</Prozess 1>
<Prozess 2>
<Thread 2>
load_image()
get_system_directory()
get_file_attributes()
open_key()
open_key()
delete_key()
</Thread 2>
</Prozess 2>

Right now I can't even find the way to load them in a way to have the text as attributes (and the file names as ids and the folder as label or something)...

Ryujakk · September 2009

Hi!

I had the same request as silentguy. I managed to hack together a new operator which lets the user specify which characters should be used as separators. You can download the modified plugin here: http://www.filedropper.com/rapidminer-advancedstringtokenizer-46
It's a temporary solution until 5.0 comes out, but it does the trick for me. It's still probably full of bugs though, so don't use it to secure a nuclear plant ;D
PM me if you want the source code or find any bugs!

dbrown · January 2010

I was using the "advanced string tokenizer" provided by Ryujjak for RM 4.6 and found it was just what I needed. Is there any plan to include this functionality in RM 5.0?

I downloaded the text processing add-on for RM 5.0 and found that it contains only the simple Tokenize block which splits at any non-letter character. I want to be able to define my splitting characters (e.g., split only at whitespace and square brackets) so that RM will not split terms such as netshare1_user1 (which I want to keep as a single term).

Will this feature be offered--if so, when is the plan to release it?

Thanks,
David

fischer · January 2010

Hi,

looks like it is not worth making a plugin for a single operator :-) But of course, if you send me the source code, I'd love to build it into the next release of the text extension. Unfortunately, the link seems to be no longer working.

Cheers,
Simon

Ryujakk · January 2010

Hi,

Indeed, for some reason my promotional 30 year account of 250 GB of storage space at filedropper seems to have been terminated

Anyhow, I ported the operator to be compatible with RM 5.0. For now, you can download it at http://www.megaupload.com/?d=0WP7NMQG (source included.)

I added the following code to the package com.rapidminer.operator.text.io.tokenizer :


/*
 *  RapidMiner
 *
 *  Copyright (C) 2001-2009 by Rapid-I and the contributors
 *
 *  Complete list of developers available at our web site:
 *
 *       http://rapid-i.com
 *
 *  This program is free software: you can redistribute it and/or modify
 *  it under the terms of the GNU Affero General Public License as published by
 *  the Free Software Foundation, either version 3 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU Affero General Public License for more details.
 *
 *  You should have received a copy of the GNU Affero General Public License
 *  along with this program.  If not, see http://www.gnu.org/licenses/.
 */
package com.rapidminer.operator.text.io.tokenizer;

import java.util.ArrayList;
import java.util.List;

import com.rapidminer.operator.OperatorDescription;
import com.rapidminer.operator.UserError;
import com.rapidminer.operator.text.Document;
import com.rapidminer.operator.text.Token;
import com.rapidminer.operator.text.io.AbstractTokenProcessor;
import com.rapidminer.parameter.ParameterType;
import com.rapidminer.parameter.ParameterTypeString;

/**
 * This class tokenizes all tokens in the input.
 * The characters used as separators can be specified.
 * 
 * @author Ryujakk
 */
public class AdvancedTokenizerOperator extends AbstractTokenProcessor {

	public static final String SEPARATORS = "characters";

	public AdvancedTokenizerOperator(OperatorDescription description) {
		super(description);
	}

@Override
	protected Document doWork(Document textObject) throws UserError {
		String separators = getParameterAsString(SEPARATORS);

		List<Token> newSequence = new ArrayList<Token>();
		for (Token token: textObject.getTokenSequence()) {
			char[] tokenChars = token.getToken().toCharArray();
			int start = 0;
			for (int i = 0; i < tokenChars.length; i++) {
				if (separators.contains(""+tokenChars)) {
					if (i - start > 0) {
						newSequence.add(new Token(new String(tokenChars, start, i - start), token));
					}
					start = i + 1;
				}
			}
			if (tokenChars.length - start > 0)
				newSequence.add(new Token(new String(tokenChars, start, tokenChars.length - start), token));
		}
		textObject.setTokenSequence(newSequence);
		return textObject;
	}

@Override
	public List<ParameterType> getParameterTypes() {
		List<ParameterType> types = super.getParameterTypes();
		types.add(new ParameterTypeString(SEPARATORS, "The characters used to separate individual tokens.", " "));
		return types;
	}
}

Basically, 2 lines changed from the original StringTokenizerOperator! It's all yours now.

- R

land · January 2010

Hi,
thank you very much. I will include it into the tokenizer now.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Define spliting characters for tokenizer?

Answers