"SVM - Text Mining"

nico · February 2014

Dear All,

I'm working a small proof of concept project on sentiment analysis using linear SVM. I'm encountering performance issues whereby when the training dataset exceeds a few thausands rows the process -specifically the SVM training- takes forever to run and the memory usage is very high too at > 8gb.
Please refer to the xml for more details.
I'm not an expert in this field but my understanding is that a property of SVM is their ability to learn can be independent of dimensionality of feature space and that training time scales quadratically with the number of records.
I would have therefore hoped that ~20k example rows shouldn't therefore be a problem.
I'm running Rapidminer on a decent machine - 6 cores 12 threads- and 32GB of ram available and have used parallel processing where available/applicable.

Many thanks
Nico

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.002">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
<parameter key="parallelize_main_process" value="true"/>
<process expanded="true">
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="30">
<parameter key="prune_above_percent" value="95.0"/>
<parameter key="prune_below_absolute" value="20"/>
<parameter key="prune_above_absolute" value="200"/>
<parameter key="prune_below_rank" value="0.25"/>
<parameter key="prune_above_rank" value="0.75"/>
<list key="specify_weights"/>
<parameter key="parallelize_vector_creation" value="true"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
<operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
<operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (Porter)" width="90" x="313" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="30"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="648" y="30">
<parameter key="max_chars" value="100"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="multiply" compatibility="6.0.002" expanded="true" height="94" name="Multiply" width="90" x="380" y="165"/>
<operator activated="true" class="set_role" compatibility="5.3.013" expanded="true" height="76" name="Set Role" width="90" x="447" y="30">
<parameter key="attribute_name" value="Sentiment"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles">
<parameter key="ID" value="id"/>
</list>
</operator>
<operator activated="true" class="select_attributes" compatibility="6.0.002" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="30">
<parameter key="attribute_filter_type" value="no_missing_values"/>
<parameter key="attribute" value="text"/>
</operator>
<operator activated="true" class="optimize_parameters_grid" compatibility="6.0.002" expanded="true" height="112" name="Optimize Parameters (Grid)" width="90" x="715" y="30">
<list key="parameters">
<parameter key="SVM (Linear).C" value="[-1.0;10;10;linear]"/>
</list>
<parameter key="parallelize_optimization_process" value="true"/>
<process expanded="true">
<operator activated="true" class="x_validation" compatibility="6.0.002" expanded="true" height="112" name="Validation" width="90" x="45" y="30">
<parameter key="parallelize_training" value="true"/>
<parameter key="parallelize_testing" value="true"/>
<process expanded="true">
<operator activated="true" class="nominal_to_binominal" compatibility="6.0.002" expanded="true" height="94" name="Nominal to Binominal" width="90" x="45" y="120">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Sentiment"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="support_vector_machine_linear" compatibility="6.0.002" expanded="true" height="76" name="SVM (Linear)" width="90" x="246" y="75">
<parameter key="kernel_cache" value="1000"/>
<parameter key="C" value="10.0"/>
</operator>
<connect from_port="training" to_op="Nominal to Binominal" to_port="example set input"/>
<connect from_op="Nominal to Binominal" from_port="example set output" to_op="SVM (Linear)" to_port="training set"/>
<connect from_op="SVM (Linear)" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="6.0.002" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="6.0.002" expanded="true" height="76" name="Performance" width="90" x="179" y="210"/>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="log" compatibility="6.0.002" expanded="true" height="76" name="Log" width="90" x="246" y="75">
<list key="log">
<parameter key="C" value="operator.SVM (Linear).parameter.C"/>
<parameter key="Performance" value="operator.Performance.value.performance"/>
</list>
</operator>
<connect from_port="input 1" to_op="Validation" to_port="training"/>
<connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
<connect from_op="Log" from_port="through 1" to_port="performance"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_performance" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_port="result 4"/>
<connect from_op="Multiply" from_port="output 2" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
<connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 2"/>
<connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_port="result 3"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>

MariusHelf · February 2014

Hi Nico,

the SVM has rather cubic than quadratic runtime in the number of examples. You are right that it is a good learning schema for high dimensional data, however its runtime increases linear in the amount of features: if you double the amount of attributes the runtimes doubles, too.
This means that for high-dimensional data with many examples the SVM will need some time. But the runtime is also influenced by the parameter C. In general, the larger the C, the higher the runtime.
You should optimize C from 0.000001 to 1 in the first step on a *logarithmic* (not linear) scale. You can set that option in the Optimize Parameters dialog. Negative numbers for C are not allowed.
You can also add a Log operator after the X-Validation and log the C value of the SVM and the performance value of the X-Validation (not of the Performance operator - that provides only the accuracy of the last fold of the cross validation). If you see that the accuracy increases towards higher numbers of C you should increase its value. Otherwise by going only up to C=1 you skip the expensive evaluations of higher C values.

The SVM is an algorithm that can't be parallelized due to its algorithmic properties. That means that to calculate one SVM only one thread can be used. However you can install the Parallel Execution extension from the marketplace. It includes a X-Validation (parallel) operator, that performs the its folds in parallel. With your 12 threads you can easily calculate all 10 folds of a X-Validation in one go. Please note that you should not cascade different XXX (parallel) operators, e.g. Optimize Parameters (parallel) and X-Validation (parallel). That can lead to undefined behavior.

The "parallelize testing/training" parameters don't parallelize the algorithms themselve, but rather executes parallel branches in the process layout in parallel (e.g. if you have two process branches after a Multiply operator).

For further speed-up it is usually sufficient to use only a 5-fold cross validation for the optimization - for a rough estimate it's enough, and you can still confirm the results after the optimization with a standard 10-fold cross validation. Also the number of optimization steps could be reduced and refined later on.

As a final remark please note that usually the steps on both sides of the X-Validation should be the same. Probably you can remove the Nominal to Binominal operator on the left side. The process should run fine despite the warning.

Please find the modified process below (does not use the X-Validation (parallel) yet). Since I don't have your data I can't test it, but you should be able to adapt it if necessary.

If you have any questions left please come back to us. Additional for pre-sales support please contact one of our sales teams at http://rapidminer.com/about-us/contact-us/ .

Best regards,
Marius

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.002">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="6.0.002" expanded="true" name="Process">
    <parameter key="parallelize_main_process" value="true"/>
    <process expanded="true">
      <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="45" y="30">
        <parameter key="prune_above_percent" value="95.0"/>
        <parameter key="prune_below_absolute" value="20"/>
        <parameter key="prune_above_absolute" value="200"/>
        <parameter key="prune_below_rank" value="0.25"/>
        <parameter key="prune_above_rank" value="0.75"/>
        <list key="specify_weights"/>
        <parameter key="parallelize_vector_creation" value="true"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases" width="90" x="179" y="30"/>
          <operator activated="true" class="text:stem_porter" compatibility="5.3.002" expanded="true" height="60" name="Stem (Porter)" width="90" x="313" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="447" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="648" y="30">
            <parameter key="max_chars" value="100"/>
          </operator>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Stem (Porter)" to_port="document"/>
          <connect from_op="Stem (Porter)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="multiply" compatibility="6.0.002" expanded="true" height="94" name="Multiply" width="90" x="246" y="120"/>
      <operator activated="true" class="set_role" compatibility="5.3.013" expanded="true" height="76" name="Set Role" width="90" x="447" y="30">
        <parameter key="attribute_name" value="Sentiment"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles">
          <parameter key="ID" value="id"/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="6.0.002" expanded="true" height="76" name="Select Attributes" width="90" x="581" y="30">
        <parameter key="attribute_filter_type" value="no_missing_values"/>
        <parameter key="attribute" value="text"/>
      </operator>
      <operator activated="true" class="optimize_parameters_grid" compatibility="6.0.002" expanded="true" height="112" name="Optimize Parameters (Grid)" width="90" x="715" y="30">
        <list key="parameters">
          <parameter key="SVM (Linear).C" value="[1e-6;1;6;logarithmic]"/>
        </list>
        <parameter key="parallelize_optimization_process" value="true"/>
        <process expanded="true">
          <operator activated="true" class="x_validation" compatibility="6.0.002" expanded="true" height="112" name="Validation" width="90" x="45" y="30">
            <parameter key="number_of_validations" value="5"/>
            <parameter key="parallelize_training" value="true"/>
            <parameter key="parallelize_testing" value="true"/>
            <process expanded="true">
              <operator activated="true" class="nominal_to_binominal" compatibility="6.0.002" expanded="true" height="94" name="Nominal to Binominal" width="90" x="45" y="30">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="Sentiment"/>
                <parameter key="include_special_attributes" value="true"/>
              </operator>
              <operator activated="true" class="support_vector_machine_linear" compatibility="6.0.002" expanded="true" height="76" name="SVM (Linear)" width="90" x="179" y="30">
                <parameter key="kernel_cache" value="1000"/>
                <parameter key="C" value="10.0"/>
              </operator>
              <connect from_port="training" to_op="Nominal to Binominal" to_port="example set input"/>
              <connect from_op="Nominal to Binominal" from_port="example set output" to_op="SVM (Linear)" to_port="training set"/>
              <connect from_op="SVM (Linear)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="6.0.002" expanded="true" height="76" name="Apply Model" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="6.0.002" expanded="true" height="76" name="Performance" width="90" x="179" y="30"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="log" compatibility="6.0.002" expanded="true" height="76" name="Log" width="90" x="246" y="75">
            <list key="log">
              <parameter key="C" value="operator.SVM (Linear).parameter.C"/>
              <parameter key="Performance" value="operator.Validation.value.performance"/>
            </list>
          </operator>
          <connect from_port="input 1" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="averagable 1" to_op="Log" to_port="through 1"/>
          <connect from_op="Log" from_port="through 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Multiply" to_port="input"/>
      <connect from_op="Multiply" from_port="output 1" to_port="result 4"/>
      <connect from_op="Multiply" from_port="output 2" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 2"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="result 1" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="126"/>
      <portSpacing port="sink_result 5" spacing="0"/>
    </process>
  </operator>
</process>

nico · February 2014

Hi Marius,

Many thanks for the exhaustive reply.
I will give the updated process a go. I did use the X-Validation (parallel) before but that was in conjunction with other parallel operators and did not get very far so that might explain why.

Kind regards
Nico

nico · February 2014

Hi Marius,

I've ran the edited process and am now encountering the memory allocation barrier as am exceeding 8gb.
The word vector I'm working with has 20k cases and about 25k individual words. Is this too much even for SVM and would you suggest some form of dimensionality reduction?
Ideally I would like to use a larger example set.

Thanks
Nico

MariusHelf · February 2014

Are you using parallelization? In that case of course the memory consumption is multiplied by the amount of threads, and you should try with less concurrent threads.
Otherwise you could try to reduce the amount of examples as a first shot with the Sample (Stratified) operator.

Best regards,
Marius

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"SVM - Text Mining"

Answers