"Any Text Processing 5 extension examples?"

thomas0221 · February 2010

Dear RapidMiner Experts,

I am able to get RapidMiner 4.6 and Text Plugin 4.6 work with the help from "rapidminer-text-4.6-tutorial.pdf" and "rapidminer-text-4.6-examples.zip", and other online resources including the discussions in this user forum. However, when I try basic text mining tasks (such as the ones based on the idea in "rapidminer-text-4.6-examples.zip") in RapidMiner 5 with Text Processing 5 extension, I have no luck. It seems that some members in this forum have figured out how to use Text Processing 5 extension in RapidMiner 5 for some basic tasks that we can accomplish in V4.6. So I wonder whether some of experts could help to share some of your working examples of text mining process XML file with RapidMiner 5. I understand that RapidMiner 5 product team has limited resources and time. Thus they do not get a chance to provide completed tutorial and examples for Text Processing 5 extension in RapidMiner 5 (for the same reason V4.6 has Web Crawler, but V5 does not yet). I wish some community members could help out by sharing your sample XML files of text mining process. I would greatly appreciate the help. The documentation, tutorial, and examples are the single defining factor to get the software work or not.

By the way, I have been using RapidMiner only about 10 days and I am impressed with the rich features. With RapidMiner 5 I like the new flow design (compared to V4.6's tree process), meta-data availability on design page, and quick fix suggestions. However, I find that the process designed in RapidMiner 4.6 cannot be imported to RapidMiner 5. Also RapidMiner 5's process XML file cannot be opened in V4.6. I understand the significant changes from V4.6 to V5, many operators get name changed and reorganized to be more logical. I guess one way to get around for getting V4.6's process XML work in new V5 is to just redesign the process from scratch in V5.

Thanks,
Thomas

jennylynnoh · February 2010

I'm in the same boat. I've been wrangling version 5 for awhile now, and the farthest I've gotten is being able to set up the processes. However, when I review the findings, it says that every row has 0 tokens. I must be missing something, but I have no idea what. If I do manage to figure it out, I'll be sure to post a tutorial with screenshots on my blog.

-Jen

land · February 2010

Hi all,
in general RapidMiner 4.x process files are very well importable to RapidMiner 5.0. We made a huge effort in writing an import mechanism although the process structure has been changed completely and several operators had been redesigned to make their parameter settings more user friendly and understandable. Even for the old plugins we wrote Import rules and so we would have done with the Text Plugin. Unfortunately we found it much too limited in the old version, hard to maintain and it didn't fit into the RapidMiner construction with IO Objects very well, because it rather tended to writing everything into temporary files. So we decided to redesign it from scratch, keep the best ideas (and there were many) and combine it with an up to date way of handling data objects. The result is a more flexible, more powerful and a much faster (!) Extension, that unfortunately changed so much, that old processes couldn't be adapted automatically. So only for processes containing operators of the former Text Plugin, you need to redesign your processes.

Here I will give you a basic example of how to work with the Text Processing Extension. The below process will load data, that contains two attributes of type text. They are chosen for Vector Creation by the specify weights parameter of the Process Documents operator.

Inside the process Documents operator, first of all all letters are changed to lower case, then the texts are splitted into the single tokens and finally stemmed. Each token of the document delivered finally to the Process Documents operator will become part of the word list and hence a single attribute in the resulting word vector forming the example set.
During this transformation, Meta Data might be attached to the documents. If you make a breakpoint inside the Process Documents operator, you will see all meta data at the right of the text. This meta data is added as additional attribute to the resulting ExampleSet if the add_meta_information parameter of the Process Documents operator is checked.

Here's the process:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
      <location/>
      <location/>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <process expanded="true" height="586" width="683">
      <operator activated="true" class="retrieve" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
        <parameter key="repository_entry" value="D01 - ProcessedHotelCustomerSatisfaction_de"/>
      </operator>
      <operator activated="true" class="sample" expanded="true" height="76" name="Sample" width="90" x="179" y="30">
        <parameter key="sample_size" value="1000"/>
      </operator>
      <operator activated="true" class="nominal_to_binominal" expanded="true" height="94" name="Nominal to Binominal" width="90" x="313" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="customer_type|customer_group|customer_age"/>
      </operator>
      <operator activated="true" class="nominal_to_numerical" expanded="true" height="94" name="Nominal to Numerical" width="90" x="514" y="30">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="reasons_positive|reasons_negative"/>
        <parameter key="invert_selection" value="true"/>
      </operator>
      <operator activated="true" class="group_models" expanded="true" height="94" name="Group Models" width="90" x="514" y="165"/>
      <operator activated="true" class="store" expanded="true" height="60" name="Store" width="90" x="648" y="165">
        <parameter key="repository_entry" value="D02 - PreprocessingModels"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" expanded="true" height="76" name="Process Documents from Data" width="90" x="45" y="255">
        <parameter key="keep_text" value="true"/>
        <parameter key="prunde_below_percent" value="1.0"/>
        <parameter key="prune_above_percent" value="40.0"/>
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="reasons_negative" value="1.0"/>
          <parameter key="reasons_positive" value="1.0"/>
        </list>
        <process expanded="true" height="586" width="683">
          <operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
          <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="179" y="30"/>
          <operator activated="true" class="text:stem_german" expanded="true" height="60" name="Stem (German)" width="90" x="313" y="30"/>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Stem (German)" to_port="document"/>
          <connect from_op="Stem (German)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="store" expanded="true" height="60" name="Store (3)" width="90" x="45" y="390">
        <parameter key="repository_entry" value="D02 - WordVectorData"/>
      </operator>
      <operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="313" y="300">
        <process expanded="true">
          <operator activated="true" class="support_vector_machine" expanded="true" height="112" name="SVM" width="90" x="160" y="30">
            <parameter key="C" value="6.309573444801933E-4"/>
          </operator>
          <connect from_port="training" to_op="SVM" to_port="training set"/>
          <connect from_op="SVM" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true">
          <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="performance_regression" expanded="true" height="76" name="Performance" width="90" x="227" y="30">
            <parameter key="root_mean_squared_error" value="false"/>
            <parameter key="absolute_error" value="true"/>
            <parameter key="squared_error" value="true"/>
            <parameter key="correlation" value="true"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="apply_model" expanded="true" height="76" name="apply on trainSet" width="90" x="514" y="300">
        <list key="application_parameters"/>
      </operator>
      <connect from_op="Retrieve" from_port="output" to_op="Sample" to_port="example set input"/>
      <connect from_op="Sample" from_port="example set output" to_op="Nominal to Binominal" to_port="example set input"/>
      <connect from_op="Nominal to Binominal" from_port="example set output" to_op="Nominal to Numerical" to_port="example set input"/>
      <connect from_op="Nominal to Binominal" from_port="preprocessing model" to_op="Group Models" to_port="models in 1"/>
      <connect from_op="Nominal to Numerical" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Nominal to Numerical" from_port="preprocessing model" to_op="Group Models" to_port="models in 2"/>
      <connect from_op="Group Models" from_port="model out" to_op="Store" to_port="input"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Store (3)" to_port="input"/>
      <connect from_op="Process Documents from Data" from_port="word list" to_port="result 3"/>
      <connect from_op="Store (3)" from_port="through" to_op="Validation" to_port="training"/>
      <connect from_op="Validation" from_port="model" to_op="apply on trainSet" to_port="model"/>
      <connect from_op="Validation" from_port="training" to_op="apply on trainSet" to_port="unlabelled data"/>
      <connect from_op="Validation" from_port="averagable 1" to_port="result 1"/>
      <connect from_op="apply on trainSet" from_port="labelled data" to_port="result 2"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

And here's what the data looks like:
label score numeric
regular reasons_negative text
regular reasons_positive text
regular customer_age polynominal
regular customer_type polynominal
regular customer_group polynominal

and here's a small snippet from the data:
1 7.3 Hoher Preis für Internetnutzung. Schnelles Hotel - schnell in der City. 41-50 Jahre geschäftlich allein reisend
2 8.7 Bei dem Preis für´s Frühstück fehlt uns ein wenig der Fisch aber es geht auch mal ohne. Auch nach unserem 3. Besuch in diesem Hotel. Alles in Ordnung, besonders das Personal, immer freundlich, immer hilfsbereit, kurz gesagt immer gut drauf. 51-60 Jahre geschäftlich als Paar reisend

I hope this will help you, to get your processes run again. After this, you will reveal the new possibilities bit by bit. Anyway we will add a basic tutorial as soon as possible.

Greetings,
Sebastian

thomas0221 · February 2010

Hi Sebastian,

Thank you so much for your example text processing XML code. Based on your example, I finally figure it out using Text Processing extension. What struck me (and maybe for other newbie) is that in RapidMiner 5 design workspace, it has parent and child sub-process. I need to navigate from parent process (such as Process Documents from Data or cross validation) to its child sub-process by double clicking the parent process. then in the child sub-process page, I can add Tokenize, stopword filter, stem ... I should not add these sub-process in the parent level process. Maybe this is the reason that I did not get RapidMiner 5 Text Processing Extension work in the first place, as put Process Documents from Data, Tokenize, stopword filter, stem ... at the same level and try to connect them. anyway, it is only my partial understanding and I might be wrong. While in RapidMiner 4.6 Text Plugin, in the tree design mode, everything appears in the same page. Moving to RapidMiner 5, I should understand the parent-child sub process relationship. Just in case other users want to see a simple example, I attach my text mining process XML file bellow. You could change the text directories to your local ones, while I use the example data coming with wvtool-1.1.

Thomas

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input>
      <location/>
    </input>
    <output>
      <location/>
    </output>
    <macros/>
  </context>
  <operator activated="true" class="process" expanded="true" name="Process">
    <parameter key="logverbosity" value="3"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="1"/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="parallelize_main_process" value="false"/>
    <process expanded="true" height="622" width="300">
      <operator activated="true" class="text:process_document_from_file" expanded="true" height="76" name="Process Documents from Files" width="90" x="45" y="30">
        <list key="text_directories">
          <parameter key="atheism" value="C:\temp\temp\tests\wvtool-1.1\wvtool-1.1\examples\data\alt.atheism"/>
          <parameter key="christian" value="C:\temp\temp\tests\wvtool-1.1\wvtool-1.1\examples\data\soc.religion.christian"/>
        </list>
        <parameter key="file_pattern" value="*"/>
        <parameter key="extract_text_only" value="true"/>
        <parameter key="use_file_extension_as_type" value="true"/>
        <parameter key="content_type" value="0"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="create_word_vector" value="true"/>
        <parameter key="vector_creation" value="0"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="false"/>
        <parameter key="prune_method" value="0"/>
        <parameter key="prunde_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="5.0"/>
        <parameter key="prune_above_rank" value="5.0"/>
        <parameter key="datamanagement" value="7"/>
        <parameter key="parallelize_vector_creation" value="false"/>
        <process expanded="true" height="622" width="570">
          <operator activated="true" class="text:tokenize" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30">
            <parameter key="mode" value="0"/>
            <parameter key="characters" value=".:"/>
          </operator>
          <operator activated="true" class="text:transform_cases" expanded="true" height="60" name="Transform Cases" width="90" x="180" y="30">
            <parameter key="transform_to" value="0"/>
          </operator>
          <operator activated="true" class="text:filter_stopwords_english" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="315" y="30"/>
          <operator activated="true" class="text:stem_porter" expanded="true" height="60" name="Stem (Porter)" width="90" x="450" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
          <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="36"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="x_validation" expanded="true" height="112" name="Validation" width="90" x="180" y="30">
        <description>A cross-validation evaluating a decision tree model.</description>
        <parameter key="create_complete_model" value="false"/>
        <parameter key="average_performances_only" value="true"/>
        <parameter key="leave_one_out" value="false"/>
        <parameter key="number_of_validations" value="10"/>
        <parameter key="sampling_type" value="2"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <parameter key="parallelize_training" value="false"/>
        <parameter key="parallelize_testing" value="false"/>
        <process expanded="true" height="654" width="466">
          <operator activated="true" class="decision_tree" expanded="true" height="76" name="Decision Tree" width="90" x="45" y="30">
            <parameter key="criterion" value="gain_ratio"/>
            <parameter key="minimal_size_for_split" value="4"/>
            <parameter key="minimal_leaf_size" value="2"/>
            <parameter key="minimal_gain" value="0.1"/>
            <parameter key="maximal_depth" value="20"/>
            <parameter key="confidence" value="0.25"/>
            <parameter key="number_of_prepruning_alternatives" value="3"/>
            <parameter key="no_pre_pruning" value="false"/>
            <parameter key="no_pruning" value="false"/>
          </operator>
          <connect from_port="training" to_op="Decision Tree" to_port="training set"/>
          <connect from_op="Decision Tree" from_port="model" to_port="model"/>
          <portSpacing port="source_training" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
          <portSpacing port="sink_through 1" spacing="0"/>
        </process>
        <process expanded="true" height="654" width="466">
          <operator activated="true" class="apply_model" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
            <list key="application_parameters"/>
            <parameter key="create_view" value="false"/>
          </operator>
          <operator activated="true" class="performance" expanded="true" height="76" name="Performance" width="90" x="179" y="30">
            <parameter key="use_example_weights" value="true"/>
          </operator>
          <connect from_port="model" to_op="Apply Model" to_port="model"/>
          <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
          <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
          <portSpacing port="source_model" spacing="0"/>
          <portSpacing port="source_test set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_averagable 1" spacing="0"/>
          <portSpacing port="sink_averagable 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Files" from_port="example set" to_op="Validation" to_port="training"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

land · February 2010

Hi Thomas,
if you are used to the tree, you might add the Tree as a View in RapidMiner 5, too. It will give you an overview what your process is about. There has been only slight changes, because now subprocesses are modeled explicitly instead of the implicit design in RapidMiner 4.x

Greetings,
Sebastian

thomas0221 · March 2010

Hi Sebastian,

Thank you for your help. I do find "Tree View" in RapidMiner V5, under "View" --"Show View". So I can use the Tree view in RapidMiner V5.

In RapidMiner V5, I see a new feature of searching operators by name. I can type in part of the name of an operator that I vaguely remember, then the software will find some relevant ones for me. However, in Rapidminer 4.6 I do not see such operator search filter. Is there any way to search operators in RapidMiner V4.6?

Moreover, in RapidMiner V4.6, it has BOX View that I can export to a JPEG file of the process design. In RapidMiner V5, I cannot find such BOX View. So does RapidMiner V5 only support Flow View and Tree View? No Box View anymore?

Thanks!

Thomas

land · March 2010

Hi,
the search box in RapidMiner 4.6 is below the operator tree. But it is there. Otherwise you could use the new operator dialog, where you can filter and search after various properties.

The box view is gone now, because the data flow is now modeled explicitly and not implicitly, so that the process isn't well defined with only the execution order of the operators.

Greetings,
Sebastian

Jepse · April 2010

@Sebastian:
Can you provide a 100 rows (or more) snip of the file "D01 - ProcessedHotelCustomerSatisfaction_de"? I couldn't find it in the sample repository.
Do you plan to provide samples for the new text processing extension?

land · April 2010

Hi,
we already planned to deliver it with the first version...I will see what we can do.

Greetings,
Sebastian

Jepse · April 2010

Oh, nice! Can't wait for it :-)

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Any Text Processing 5 extension examples?"

Answers