"Filter words from rapidminer"

sunnyfunghysunnyfunghy Member Posts: 19 Contributor II
edited June 2019 in Help
Hi, everyone,

      Further wants to ask,
In a text file,
線上(Nc) 展示(VC) 使用(VC) 簡化(VHC) 詞類(Na) 進行(VC) 斷詞(VA) 標記(Na)  Happy family Na

When using "process documents from file" operator with Tokenize inside, it will generate
線上
Nc
展示
VC
使用
VC
簡化
VHC
詞類
Na
進行
VC
斷詞
VA
標記
Na  (2 times)
Happy
family


But I would like to ask how to filter the words containing brackets inside. I have used "filter tokens (by content)" but can only filter one word.  Can anyone tell me which suitable operator should be used and what command code is . Thank you very much

Cheers,
Sunny

Answers

  • haddockhaddock Member Posts: 849 Maven
    Hi there,

    You can use the regex option of that operator. It shows up if you use 'matches' in the condition parameter.
  • sunnyfunghysunnyfunghy Member Posts: 19 Contributor II
    Hi,
          Thank you very much. I use similar method but use "Filter stopwords by dictionary". Inside the text file of dictionary, I write the words that I need to filter. Do you think it is more convenience? ;D
    Again. Thank you very much for all help


    Sunny
  • sunnyfunghysunnyfunghy Member Posts: 19 Contributor II
    Hi everyone,
                      I have another problem about memory problem.  Actually, I need to enter 20 directories from "process documents from file" for prediction. Each directory contains at least 1000 samples. However, when I do training and testing according to following model (XML code) after simulating for more than 1 hour, the computer said I had not enough memory. My computer is Duo Core 3GHz and 2 G Ram. How can I change the model or increase the memory (instead of buying memory) to simulate all of them?

    What I am doing now is to put different kinds of newspaper topic (entertainment, sports, international , religion...etc. at least 20 topics) for training, then the computer will predict which newspaper belongs to which topic in testing part. So large amount of samples are needed in training part. Look forward to hearing from you soon. Thank you very much for all help

    Sunny


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.004">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.004" expanded="true" name="Process">
        <process expanded="true" height="341" width="413">
          <operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files" width="90" x="112" y="75">
            <list key="text_directories">
              <parameter key="computer graphics" value="C:\Documents and Settings\sunny\Desktop\20_newsgroups\comp.graphics"/>
              <parameter key="electronics" value="C:\Documents and Settings\sunny\Desktop\20_newsgroups\sci.electronics"/>
              <parameter key="motorcycle" value="C:\Documents and Settings\sunny\Desktop\20_newsgroups\rec.motorcycles"/>
              <parameter key="medicine" value="C:\Documents and Settings\sunny\Desktop\20_newsgroups\sci.med"/>
            </list>
            <process expanded="true" height="517" width="806">
              <operator activated="true" class="text:tokenize" compatibility="5.1.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
              <operator activated="true" class="text:transform_cases" compatibility="5.1.001" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="120"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.001" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="45" y="210"/>
              <operator activated="true" class="text:filter_by_length" compatibility="5.1.001" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="45" y="345"/>
              <operator activated="true" class="text:stem_snowball" compatibility="5.1.001" expanded="true" height="60" name="Stem (Snowball)" width="90" x="313" y="30"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
              <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Snowball)" to_port="document"/>
              <connect from_op="Stem (Snowball)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="split_validation" compatibility="5.1.004" expanded="true" height="130" name="Validation" width="90" x="313" y="30">
            <process expanded="true" height="517" width="378">
              <operator activated="true" class="neural_net" compatibility="5.1.004" expanded="true" height="76" name="Neural Net" width="90" x="127" y="66">
                <list key="hidden_layers"/>
              </operator>
              <connect from_port="training" to_op="Neural Net" to_port="training set"/>
              <connect from_op="Neural Net" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="517" width="378">
              <operator activated="true" class="apply_model" compatibility="5.1.004" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.1.004" expanded="true" height="76" name="Performance" width="90" x="80" y="145"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
              <portSpacing port="sink_averagable 3" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="text:process_document_from_file" compatibility="5.1.001" expanded="true" height="76" name="Process Documents from Files (2)" width="90" x="112" y="255">
            <list key="text_directories">
              <parameter key="graphics" value="C:\Documents and Settings\sunny\Desktop\20_newsgroups\comp.graphics"/>
            </list>
            <process expanded="true" height="517" width="806">
              <operator activated="true" class="text:tokenize" compatibility="5.1.001" expanded="true" height="60" name="Tokenize (2)" width="90" x="112" y="30"/>
              <operator activated="true" class="text:transform_cases" compatibility="5.1.001" expanded="true" height="60" name="Transform Cases (2)" width="90" x="112" y="120"/>
              <operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.001" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="112" y="210"/>
              <operator activated="true" class="text:filter_by_length" compatibility="5.1.001" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="112" y="300"/>
              <operator activated="true" class="text:stem_snowball" compatibility="5.1.001" expanded="true" height="60" name="Stem (2)" width="90" x="384" y="30"/>
              <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
              <connect from_op="Tokenize (2)" from_port="document" to_op="Transform Cases (2)" to_port="document"/>
              <connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
              <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
              <connect from_op="Filter Tokens (2)" from_port="document" to_op="Stem (2)" to_port="document"/>
              <connect from_op="Stem (2)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="apply_model" compatibility="5.1.004" expanded="true" height="76" name="Apply Model (2)" width="90" x="282" y="261">
            <list key="application_parameters"/>
          </operator>
          <connect from_op="Process Documents from Files" from_port="example set" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="model" to_op="Apply Model (2)" to_port="model"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
          <connect from_op="Validation" from_port="averagable 2" to_port="result 2"/>
          <connect from_op="Process Documents from Files (2)" from_port="example set" to_op="Apply Model (2)" to_port="unlabelled data"/>
          <connect from_op="Apply Model (2)" from_port="labelled data" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
        </process>
      </operator>
    </process>
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I would recommend to use a linear SVM instead of the Neural Net. Probably already solves your problem. Then take a look into the System Monitor on the Result Perspective: What is the maximum amount of memory that RapidMiner is allowed to use? Is it anything useful like 1.2 GB?

    Actually your setup contains two major design faults:
    1. Process Documents not inside XValidation. This will cause unrealistic high performance estimations since the presence of the attributes already is information that the learner can use, even if the words does not occur in the training set.
    2. During the Apply step the process docuemtns operator needs to use the exactly same WordList as used during training. Otherwise the attribute set will differ and even if the word occurs, the scale will be completely different! So forward the WordLilst output port to the inputport.

    Greetings,
    Sebastian

  • sunnyfunghysunnyfunghy Member Posts: 19 Contributor II
    Thank you very much. Thank you very much.

    I think that I understand. The simulation is good now. But would you have time, would you mind giving me the models that you suggest? You can copy the XML code to me. because I am quite confused of two major design faults that you mentioned.  Thank you very much.


    Sunny
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    see the sample processes in the Sample repository.

    Greetings,
    Sebastian
Sign In or Register to comment.