CSV to N-Gram Process

dsaraph · September 2011

I have a large CSV file that I'm trying to process in order to generate n-grams for a selected attribute. The process is currently as follows:

Read CSV => Select Attribute => Nominal to Text => Data to Documents => Process Documents => Mutual Information
=> Wordlist to Data

Now, within the Process Documents operator I have:

Tokenize => Transform Cases (to lower case) => Filter Tokens => Stem => Filter Stopwords => Generate N-grams

I can run the process without error, but for some reason it seems to generate results after "Process Documents", but when I go to view results none are shown. The run time is also very quick, I would expect it to take a few minutes at least for the size of the file.

Can anyone shine some light on possible breaks in this process or modifications so that I can get results?

Thanks.

dsaraph · September 2011

Update: I was able to get it to run on a small set of data (100 lines), but it seems inefficient at the moment. The mistake I had was in how I was reading the CSV file. I need to read a larger data set, and was wondering if there are ways to improve the efficiency so that it doesn't run out of memory. Perhaps writing to a file? I can run on a server, but I'm trying to see if there are ways to streamline this ahead of that.

JEdward · September 2011

What about storing the CSV in the local repository so that the metadata is available for the process and doesn't have to be created?
e.g. splitting it into 2 or more processes.

e.g.
Process 1
Read CSV => Select Attribute => Nominal to Text => Store Result
Process2
Load Result => Data to Documents => Process Documents => Mutual Information=> Wordlist to Data

Could this help? (It has on some of mine reading large datasets.)

Best,
JEdward

dsaraph · September 2011

Thanks for the tip, I'll try to run it as two processes. I got rid of the mutual information matrix so the n-grams are created quite a bit faster, but still not to the efficiency I was hoping for. Do you know how I can increase the environment variable for memory allocation? I know that was a suggestion made in another post, but I can't seem to find how/where to do this.

Thanks.

dsaraph · September 2011

JEdward,

Is there a way that I can have the processes run one after another? I'm new to RM so pardon the perhaps obvious questions, but when I store the data that's the end of one process and when I retrieve that's the beginning of another, however, is there a way to link these so that the whole thing runs in one fluid motion?

colo · September 2011

Hi,

you can easily chain processes by using the "Execute Process" operators in some sort of super-process. You can also drag processes from the repository view into the process view, which will create these operators readily set for the chosen processes.

For memory settings please have a look at the installation guide, this should cover the relevant steps: http://rapid-i.com/content/view/17/211/lang,en/

Regards
Matthias

JEdward · September 2011

Hi,

I was about to post the same answer, but Matthias beat me to it.

He clearly gets up much earlier than I.

Regards,
JEdward

colo · September 2011

Hi JEdward,

sorry - I did not mean to capture the topic :-X

Unless you slept until noon

, I doubt that I'm getting up much earlier. It all depends on the time zone...

Regards
Matthias

dsaraph · September 2011

Thanks for the great advice from the both of you, the process works quite well. I wanted to make some minor modifications to the output of the n-grams, to further improve efficiency. I don't require n-grams that are of one, two, or three words, only those that are 4 or greater. However, the n-gram operator for RM only asks for a "maximum" set of words, but no minimum. Is there a way that I would be able to set some sort of minimum for the n-gram output so that it would skip over 1, 2, and 3 word n-grams and only output those that are 4 or 5?

Your help is greatly appreciated.

dsaraph · September 2011

I was trying to figure out regular expression handling... for example, I want to make it so that if there are two underscores ex. "word_word_word", or more, then keep the n-gram, if not then disregard. This would give me all n-grams with 3 or more words in them. If I changed to to three underscores, then I would have n-grams with 4 or more words, and so on and so forth.

I wasn't able to figure out how to do this though. It'd be great help if someone can provide further info on this or suggest another method.

colo · September 2011

Hi,

this sounds like you are creating token n-grams, no character n-grams, right? I didn't know this is possible until now, but I just found both of the operators

It took me 2 minutes to extend the operator and now provide a min-length parameter. If you are able to build the extension from source yourself, I will provide you the modified source code. Otherwise I might send you the jar file ready for inclusion into your RapidMiner (E-Mail?).
If you want to use the regex approach instead (but this won't reduce processing time, which my modification should do), try "Filter Tokens (by Content)", set the condition parameter to matches and use something like

(.*_.*){2,}

to keep only n-grams of at least 3 words (since two underscores are required).

Regards
Matthias

dsaraph · September 2011

Hi Matthias,

Yes, I'm creating token n-grams, in order to see patterns in text.

I'm not familiar with how to extend an operator, but if you can provide the source code I can give it a shot. My email is d.saraph@gmail.com, where the jar file can be sent as well What would the reasoning be behind your modification being able to reduce processing time as opposed to the token filter operator?

For now, I will try the filter token operator, but I would be happy to improve the efficiency of this process if possible.

Thank you for your help.

colo · September 2011

Hi,

I just sent the mail. Hope this will be delivered since the attachment size is above 15 MB.

Creating all n-grams and using additional computations to remove them afterwards will probably consume more time, then just creating the desired ones. This is the reasoning

If you are also interested in the code, this is the slightly modified part of TermNGramGeneratorOperator


int maxLength = getParameterAsInt(PARAMETER_MAX_LENGTH);
int minLength = getParameterAsInt(PARAMETER_MIN_LENGTH);

for (int i = 0; i < tokenList.size(); i++) {
	for (int j = minLength - 1; j < maxLength; j++) {
		StringBuffer s = new StringBuffer();
		if (i + j < tokenList.size())
			for (int z = i; z < i + j + 1; z++) {
				s.append(tokenList.get(z));
				if (z != i + j)
					s.append('_');
			}
		if (s.length() > 0)
			ngrams.add(new Token(s.toString(), tokenList.get(i)));
	}
}

And this adds the additional parameter:

types.add(new ParameterTypeInt(PARAMETER_MIN_LENGTH, "The minimal length of the ngrams.", 1, (int) Double.POSITIVE_INFINITY, 2, false));

Hope this will help you somehow...

Best regards
Matthias

dsaraph · September 2011

Hi Matthias,

Thanks for this, as well as emailing me the .jar file. I'm going to try implementing this shortly and will post back on here regarding my progress.

dsaraph · September 2011

Hi Matthias,

Just wanted to report back that I was able to run the n-grams quite well, but in the end the results were not exactly what I was looking for so I'm going to be tinkering with the data for the next little bit. Thanks for all your help on this.

On another topic, I wanted to inquire if anyone was familiar with word clustering. For example, is there a way that I can cluster the text without considering the order (n-grams are formed based on the order of the words)... I was looking into some of the clustering operators but I'm not sure what would be applicable to what I'm trying to do. I was hoping there would be an operator that could just replace the n-gram operator in order to carry this out since I still wanted the pre-processing of the data, stemming, and filtering as I currently have. Any suggestions are greatly appreciated.

Thanks.

gunjanamit · June 2012

Hi,

I wanted to extract number "2ADFH0B121AO92" from comments.

I have used read excel->nominal to text->process documents->n-gram ->14

But its not working. Can you suggest what can be done pls...

MariusHelf · June 2012

gunjanamit, this is the fourth thread where you ask the very same question. You won't get your answers faster if you spam the forum. Btw, your question has been answered in this thread. You don't need n-grams at all, they serve a very different purpose.

sowmya_srvsn · May 2017

any response on this? I have a similar use case where order of n-grams dont matter and i want to group "word1-word2-word3" same as "word2-word1-word3"

is that possible?

The link posted above doesnt work

Thomas_Ott · May 2017

Can't you use the Extract Information operator for this?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

CSV to N-Gram Process

Answers