The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
memory problem and Chinese characters recognition
sunnyfunghy
Member Posts: 19 Contributor II
Hi, everyone,
I have two questions to ask in text mining:
1) After dowloading 20 newsgroups from the internet, there are 20 directories inside. Each directory contains above 1000 data file. However, when I input all, it shows that my computer memory isn't enough. I can do it successfully after inputting only 4 directories to predict the data. Does anyone has similar problem? Apart from buying a new powerful computer, can solve it?
2) can Chinese characters text can be used in rapidminer in text mining? If Yes, how to use? Thanks a lot
Sunny
I have two questions to ask in text mining:
1) After dowloading 20 newsgroups from the internet, there are 20 directories inside. Each directory contains above 1000 data file. However, when I input all, it shows that my computer memory isn't enough. I can do it successfully after inputting only 4 directories to predict the data. Does anyone has similar problem? Apart from buying a new powerful computer, can solve it?
2) can Chinese characters text can be used in rapidminer in text mining? If Yes, how to use? Thanks a lot
Sunny
0
Answers
this depends. Please check if you RapidMiner installation really makes use of your memory. Go for the Results Perspective and take a look into the Resource Monitor. You can use up to 70%~80% of your main memory for RapidMiner if you don't have any other programs running.
To your second question: Yes RapidMiner can use any UTF character. All you have to do is to switch the encoding for the input text files in the Process Documents from Files operator. Only thing I don't know: Usually RapidMiner splits texts by splitting on each position that is not a character. Don't know if chinese language has non-character positions between words? Unfortunately my chinese is a little bit rusty...
Greetings,
Sebastian
I have tried it. But want to ask about Chinese Characters.
In Chinese version window, rapidminer can detect Chinese words inside it. But another computer which uses English version window, it cannot encode Chinese Characters. So would you mind telling me how to solve it?
Sunny
Although it can detect chinese words from rapidminer now, the words are not meaningful. For example, in english word, rapdminer can show the words separately because english words contain space between each word. However, Chinese doesn't. So I would like to ask if rapidminer can do in chinese word also similar to English words, making the words meaningful? And I know that rapidminer has a operator called " filter stopwords (Dictionary)". Does rapidminer provide Dictionrary in Chinese word? I know that I can separate Chinese words such as in chinese newspaper using word segmentation method. But it spends lots of time to do. Can anyone help me?? ???
Many thanks,
Sunny
actually the best method in community edition is to simply treat each single character as a word. You can use the Character N Gram with n=1 for achieving this.
As an enterprise customer you would have access to a sophisticated chinese word splitter, but it comes with the disadvantage of high computational costs...
Greetings,
Sebastian
Further wants to ask,
In a text file,
線上(Nc) 展示(VC) 使用(VC) 簡化(VHC) 詞類(Na) 進行(VC) 斷詞(VA) 標記(Na) Happy family Na
When using "process documents from file" operator with Tokenize inside, it will generate
線上
Nc
展示
VC
使用
VC
簡化
VHC
詞類
Na
進行
VC
斷詞
VA
標記
Na (2 times)
Happy
family
But I would like to ask how to filter the words containing brackets inside. I have used "filter tokens (by content)" but can only filter one word. Can anyone tell me which suitable operator should be used and what command code is . Thank you very much
Sunny
filtering words containing brackets should be possible by using a regular expression like in this small example: @Sebastian: I was wondering why I had to use ".*" although I selected mode "contains match". The expression "[()]+" should have been enough in my opinion, but it didn't work.
Regards
Matthias
as far as i know this is not the case if brackets are used inside character classes. So this wasn't the reason for my useless "contains match" attempt.
Regards
Matthias
You're quite right, POSIX bracket expressions don't need bracket escaping, and are not the reason for '[()]+' not to work - that is down to the the plus sign. '[()]+' would match a round bracket followed by '+', so ' (+' or ' )+' but not...
';^)'
Hi everyone,
I have another problem about memory problem. ??? Actually, I need to enter 20 directories from "process documents from file" for prediction. Each directory contains at least 1000 samples. However, when I do training and testing according to following model (XML code) after simulating for more than 1 hour, the computer said I had not enough memory. My computer is Duo Core 3GHz and 2 G Ram. How can I change the model or increase the memory (instead of buying memory) to simulate all of them?
What I am doing now is to put different kinds of newspaper topic (entertainment, sports, international , religion...etc. at least 20 topics) for training, then the computer will predict which newspaper belongs to which topic in testing part. So large amount of samples are needed in training part. Look forward to hearing from you soon. Thank you very much for all help
Sunny
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.004">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.004" expanded="true" name="Process">
<process expanded="true" height="341" width="413">
<operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="75">
<list key="text_directories">
<parameter key="computer graphics" value="C:\Documents and Settings\sunny\Desktop\20_newsgroups\comp.graphics"/>
<parameter key="electronics" value="C:\Documents and Settings\sunny\Desktop\20_newsgroups\sci.electronics"/>
<parameter key="motorcycle" value="C:\Documents and Settings\sunny\Desktop\20_newsgroups\rec.motorcycles"/>
<parameter key="medicine" value="C:\Documents and Settings\sunny\Desktop\20_newsgroups\sci.med"/>
</list>
<process expanded="true" height="517" width="806">
<operator activated="true" class="text:tokenize" compatibility="5.1.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="5.1.001" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="120"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.001" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="45" y="210"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.001" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="45" y="345"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.1.001" expanded="true" height="60" name="Stem (Snowball)" width="90" x="313" y="30"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
<connect from_op="Stem (Snowball)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="split_validation" compatibility="5.1.004" expanded="true" height="130" name="Validation" width="90" x="313" y="30">
<process expanded="true" height="517" width="378">
<operator activated="true" class="neural_net" compatibility="5.1.004" expanded="true" height="76" name="Neural Net" width="90" x="127" y="66">
<list key="hidden_layers"/>
</operator>
<connect from_port="training" to_op="Neural Net" to_port="training set"/>
<connect from_op="Neural Net" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true" height="517" width="378">
<operator activated="true" class="apply_model" compatibility="5.1.004" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="5.1.004" expanded="true" height="76" name="Performance" width="90" x="80" y="145"/>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
<portSpacing port="sink_averagable 3" spacing="0"/>
</process>
</operator>
<operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="112" y="255">
<list key="text_directories">
<parameter key="graphics" value="C:\Documents and Settings\sunny\Desktop\20_newsgroups\comp.graphics"/>
</list>
<process expanded="true" height="517" width="806">
<operator activated="true" class="text:tokenize" compatibility="5.1.001" expanded="true" height="60" name="Tokenize (2)" width="90" x="112" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="5.1.001" expanded="true" height="60" name="Transform Cases (2)" width="90" x="112" y="120"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.001" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="112" y="210"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.001" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="112" y="300"/>
<operator activated="true" class="text:stem_snowball" compatibility="5.1.001" expanded="true" height="60" name="Stem (2)" width="90" x="384" y="30"/>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
<connect from_op="Stem (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="apply_model" compatibility="5.1.004" expanded="true" height="76" name="Apply Model (2)" width="90" x="282" y="261">
<list key="application_parameters"/>
</operator>
<connect from_op="Process Documents from Files" from_port="example set" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
<connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
<connect from_op="Validation" from_port="averagable 2" to_port="result 2"/>
<connect from_op="Process Documents from Files (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
</process>
</operator>
</process>
please stop double-posting new questions (perhaps abandon your new topic http://rapid-i.com/rapidforum/index.php/topic,3476.0.html).
Do the memory problems occur before the model is trained? Otherwise you could store the model and then make your predictions in smaller pieces to avoid memory limitations. How much of your memory do you provide for RapidMiner? I absolutely have to disagree on this, the plus sign is a valid quantifier saying that at least one bracket has to occur. In my experience character classes are mostly used with a following quantifier. This works fine in other areas, only the "contains match" option didn't do what I expected.
I'm not a real expert on regex, so I use a program called RegexBuddy, whose author is an expert. You can paste in regex, get an interpretation, and export it, like this.. Perhaps you should take this up with him, I'd hate to think of people being mis-informed.
http://www.regexbuddy.com/index.html
The memory problems usually occur in split validation part, especially neural net model inside split validation. You can see the errors follows:
Mar 28, 2011 4:51:21 PM CONFIG: Loading perspectives.
Mar 28, 2011 4:51:22 PM CONFIG: Ignoring update check. Last update check was on Mon Mar 28 09:57:03 CST 2011
Mar 28, 2011 4:51:22 PM INFO: Connecting to: http://www.myexperiment.org/workflows.xml?num=100
Mar 28, 2011 4:53:41 PM CONFIG: Saving property rapidminer.gui.confirm_exit=true
Mar 28, 2011 4:56:21 PM INFO: No filename given for result file, using stdout for logging results!
Mar 28, 2011 4:56:21 PM INFO: Loading initial data.
Mar 28, 2011 4:56:21 PM INFO: Process starts
Mar 28, 2011 4:57:58 PM INFO: Saving results.
Mar 28, 2011 4:57:58 PM INFO: Process finished successfully after 1:36
Mar 28, 2011 4:58:43 PM INFO: No filename given for result file, using stdout for logging results!
Mar 28, 2011 4:58:43 PM INFO: Loading initial data.
Mar 28, 2011 4:58:43 PM INFO: Process starts
Mar 28, 2011 5:00:16 PM INFO: Saving results.
Mar 28, 2011 5:00:16 PM INFO: Process finished successfully after 1:33
Mar 28, 2011 5:00:37 PM INFO: No filename given for result file, using stdout for logging results!
Mar 28, 2011 5:00:37 PM INFO: Loading initial data.
Mar 28, 2011 5:00:37 PM INFO: Process starts
Mar 28, 2011 5:02:09 PM INFO: Saving results.
Mar 28, 2011 5:02:09 PM INFO: Process finished successfully after 1:31
Mar 28, 2011 5:11:24 PM INFO: No filename given for result file, using stdout for logging results!
Mar 28, 2011 5:11:24 PM INFO: Loading initial data.
Mar 28, 2011 5:11:24 PM INFO: Process starts
Mar 28, 2011 5:12:08 PM INFO: Saving results.
Mar 28, 2011 5:12:08 PM INFO: Process finished successfully after 43 s
Mar 28, 2011 5:19:55 PM INFO: No filename given for result file, using stdout for logging results!
Mar 28, 2011 5:19:55 PM INFO: Loading initial data.
Mar 28, 2011 5:19:55 PM INFO: Process starts
Mar 28, 2011 5:21:06 PM INFO: ImprovedNeuralNet: No hidden layers defined. Using default hidden layer.
Mar 29, 2011 3:43:41 AM SEVERE: Process failed: GC overhead limit exceeded
Mar 29, 2011 3:43:41 AM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
+- Process Documents from Files[1] (Process Documents from Files)
subprocess 'Vector Creation'
| +- Tokenize[8038] (Tokenize)
| +- Transform Cases[8038] (Transform Cases)
| +- Filter Stopwords (English)[8038] (Filter Stopwords (English))
| +- Filter Tokens (by Length)[8038] (Filter Tokens (by Length))
| +- Stem (Snowball)[8038] (Stem (Snowball))
+- Validation[1] (Split Validation)
subprocess 'Training'
==> | | +- Neural Net[1] (Neural Net)
subprocess 'Testing'
| +- Apply Model[0] (Apply Model)
| +- Performance[0] (Performance)
+- Process Documents from Files (2)[0] (Process Documents from Files)
subprocess 'Vector Creation'
| +- Tokenize (2)[0] (Tokenize)
| +- Transform Cases (2)[0] (Transform Cases)
| +- Filter Stopwords (2)[0] (Filter Stopwords (English))
| +- Filter Tokens (2)[0] (Filter Tokens (by Length))
| +- Stem (2)[0] (Stem (Snowball))
+- Apply Model (2)[0] (Apply Model)
Mar 29, 2011 9:11:03 AM INFO: Saved process definition at '//NewLocalRepository/ask result'.
Mar 29, 2011 9:13:38 AM INFO: Decoupling process from location //NewLocalRepository/ask result. Process is now associated with file //NewLocalRepository/ask result.
Mar 29, 2011 9:43:27 AM INFO: No filename given for result file, using stdout for logging results!
Mar 29, 2011 9:43:27 AM INFO: Loading initial data.
Mar 29, 2011 9:43:27 AM INFO: Process //NewLocalRepository/ask result starts
Mar 29, 2011 9:44:45 AM INFO: ImprovedNeuralNet: No hidden layers defined. Using default hidden layer.
Mar 29, 2011 9:15:02 PM SEVERE: Process failed: GC overhead limit exceeded
Mar 29, 2011 9:15:02 PM SEVERE: Here: Process[1] (Process)
subprocess 'Main Process'
+- Process Documents from Files[1] (Process Documents from Files)
subprocess 'Vector Creation'
| +- Tokenize[8038] (Tokenize)
| +- Transform Cases[8038] (Transform Cases)
| +- Filter Stopwords (English)[8038] (Filter Stopwords (English))
| +- Filter Tokens (by Length)[8038] (Filter Tokens (by Length))
| +- Stem (Snowball)[8038] (Stem (Snowball))
+- Validation[1] (Split Validation)
subprocess 'Training'
==> | | +- Neural Net[1] (Neural Net)
subprocess 'Testing'
| +- Apply Model[0] (Apply Model)
| +- Performance[0] (Performance)
+- Process Documents from Files (2)[0] (Process Documents from Files)
subprocess 'Vector Creation'
| +- Tokenize (2)[0] (Tokenize)
| +- Transform Cases (2)[0] (Transform Cases)
| +- Filter Stopwords (2)[0] (Filter Stopwords (English))
| +- Filter Tokens (2)[0] (Filter Tokens (by Length))
| +- Stem (2)[0] (Stem (Snowball))
+- Apply Model (2)[0] (Apply Model)
Mar 30, 2011 9:20:54 AM CONFIG: Saving property rapidminer.gui.log_level=ALL
Mar 30, 2011 9:20:59 AM FINER: Parameter 'logfile' is not set. Using default ('').
In the first part of process document of file model, I set "double-sparse_array" in datamanagement part of "process document of file model", it simulates about 1 and half hour and say memory problem error.
Then, I simulate again with changing "long_sparese_array" in datamanagement part of "process document of file" model, it simulates about at least 7 hours and say memory problem error.
Both errors are in split validation part.
My computer is 2 G Ram and duel core 3GHz CPU. How can I maximize the memory in rapidminer? How can know the memory is sufficient or not before simulating the data. because it is quite troublesome after simulating the result for 7 hours and then say Error. ??? >:(
Thank you very much for your help