"Exception: java.lang.ArrayIndexOutOfBoundsException"

bistoon_mfbistoon_mf Member Posts: 3 Contributor I
edited June 2019 in Help
Hello there,

I am using RM for a simple text clustering task. I load my sentences from excel and want to cluster them using the Kmeans clustering operator. I am encountering a weird situation. When I choose EuclideanDistance as distance measure the process works and produce the result. However when I choose CorrelationSimilarity as measure, it gives me an error. RM itself says that the current setting doesn't seem to have a problem and when I check the log the error is: SEVERE: java.lang.ArrayIndexOutOfBoundsException.

Does anybody have any idea about the source of error?

Tagged:

Answers

  • Nils_WoehlerNils_Woehler Member Posts: 463 Maven
    Hi,

    this seems to be a bug. Could you please post your process setup here so we can file a bug report?

    Best,
    Nils
  • bistoon_mfbistoon_mf Member Posts: 3 Contributor I
    Thank you Nils for the reply. Sure this is the process:


    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.007">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.007" expanded="true" name="Process">
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="5.3.007" expanded="true" height="60" name="Read Excel" width="90" x="112" y="75">
            <parameter key="excel_file" value="/Users/mfarhadloo/Documents/engapps/Documents/SentimentAnalysis/Codes/Data/P5/Nouns/P5-BON.xlsx"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="5.3.007" expanded="true" height="76" name="Nominal to Text" width="90" x="246" y="75"/>
          <operator activated="true" class="text:data_to_documents" compatibility="5.3.000" expanded="true" height="60" name="Data to Documents" width="90" x="380" y="75">
            <list key="specify_weights"/>
          </operator>
          <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="94" name="Process Documents" width="90" x="380" y="210">
            <parameter key="add_meta_information" value="false"/>
            <parameter key="keep_text" value="true"/>
            <parameter key="prune_method" value="percentual"/>
            <parameter key="prunde_below_percent" value="1.0"/>
            <parameter key="prune_above_percent" value="100.0"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="112" y="30"/>
              <operator activated="true" class="text:transform_cases" compatibility="5.3.000" expanded="true" height="60" name="Transform Cases" width="90" x="112" y="165"/>
              <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="5.3.000" expanded="true" height="76" name="Filter Stopwords (Dictionary)" width="90" x="112" y="300">
                <parameter key="file" value="/Users/mfarhadloo/Documents/engapps/Documents/SentimentAnalysis/Codes/english-stop copy.txt"/>
                <parameter key="encoding" value="SYSTEM"/>
              </operator>
              <operator activated="true" class="text:filter_by_length" compatibility="5.3.000" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="112" y="435">
                <parameter key="min_chars" value="2"/>
                <parameter key="max_chars" value="999"/>
              </operator>
              <operator activated="true" class="text:stem_porter" compatibility="5.3.000" expanded="true" height="60" name="Stem (Porter)" width="90" x="112" y="570"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
              <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
              <connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
              <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
              <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="multiply" compatibility="5.3.007" expanded="true" height="76" name="Multiply" width="90" x="514" y="210"/>
          <operator activated="true" class="k_means" compatibility="5.3.007" expanded="true" height="76" name="Clustering" width="90" x="715" y="120">
            <parameter key="k" value="20"/>
            <parameter key="max_runs" value="100"/>
            <parameter key="determine_good_start_values" value="true"/>
            <parameter key="measure_types" value="NumericalMeasures"/>
            <parameter key="numerical_measure" value="CorrelationSimilarity"/>
            <parameter key="kernel_gamma" value="0.5"/>
          </operator>
          <operator activated="true" class="cluster_distance_performance" compatibility="5.3.007" expanded="true" height="94" name="Distance" width="90" x="916" y="120"/>
          <connect from_op="Read Excel" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="example set" to_op="Multiply" to_port="input"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_op="Distance" to_port="cluster model"/>
          <connect from_op="Clustering" from_port="clustered set" to_op="Distance" to_port="example set"/>
          <connect from_op="Distance" from_port="performance" to_port="result 2"/>
          <connect from_op="Distance" from_port="example set" to_port="result 3"/>
          <connect from_op="Distance" from_port="cluster model" to_port="result 4"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>
          <portSpacing port="sink_result 5" spacing="0"/>
        </process>
      </operator>
    </process>
  • SkirzynskiSkirzynski Member Posts: 164 Maven
    I have no problem to run this process with CorrelationSimilarity as measure. Of course I have used my own dummy data. Could you provide a short snippet of your data where this error occurs?
  • bistoon_mfbistoon_mf Member Posts: 3 Contributor I
    The data that I am using are around 700 sentences. I noticed that after preprocessing and representing each sentence with the word vector, some of my sentences are represented with zero vector (they don't contain any of the words in my word list)! Is it the reason for the error that I am encountering?
  • SkirzynskiSkirzynski Member Posts: 164 Maven
    I couldn't reproduce an ArrayIndexOutOfBoundsException, but indeed you will face another problem with such an example and the CorrelationMeasure since the correlation of a zero-vector (or any other constant vector) is not defined (because the standard deviation is 0).

    We are currently evaluating if we should allow CorrelationMeasure for kMeans, because of this undefined input. As far as I know it is not even clear if kMeans converges with this measure. At least for my simple and small data set and with the option "Determine good start values" the process does not stop running.

    Nevertheless, I will come back to you after we have clarified this. In the meantime could you post your exception so that I can at least see where this exception is thrown? Otherwise I cannot help if i cannot reproduce the error.

    Beside that: Are you sure you want to use the CorrelationSimalarity? Typically CosineSimalarity is used in text mining, but often mixed up with the CorrelationSimalarity because of the quite similar names. :)
Sign In or Register to comment.