NullPointerException

DarmeDarme Member Posts: 10 Contributor II
edited November 2018 in Help
Hi,

I am a newbie to RapidMiner. I am trying to use Expectation Maximization to cluster some data. I have a around 500 000 of data rows in .csv file. I am using the process "Read CSV" -> Normalise -> Replace Missing Vlaues -> Clustering
However i always get a nullpointer exception at the clustering time  :(
I am doing something wrong here?

Thanks in advance
Darme

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Do you get an error dialog which allows to submit a bug report? If so, please use the corresponding button.
    If there is no such dialog, please post your process setup and give us a detailed description of your data (number and types of attributes, and any particularities).

    Best regards,
    Marius
  • DarmeDarme Member Posts: 10 Contributor II
    Hi Marius,

    Thank you for your prompt reply. Following is the error massage i get.

    The setup does not seem to contain any obvious errors, but you should check the log massages or activate the debug mode in the settings dialog in order to get more information about this problem

    The log contains the following

              subprocess 'Main Process'
                +- Read CSV[1] (Read CSV)
                +- Normalize[1] (Normalize)
                +- Replace Missing Values[1] (Replace Missing Values)
          ==>  +- Clustering[1] (Expectation Maximization Clustering)
    Apr 23, 2013 4:49:13 PM SEVERE: java.lang.NullPointerException

    the data has 11 attributes which are of types text, number and date. In the normalise process i have set value type to numeric
    In the clustering i have set randomly assigned examples
    In the  Replace Missing Values i have set attribute filter type to all and default to average

    do you need any more information?  Please let me know

    Thanks again
    Darme
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    it seems that you also have missing values in your nominal and/or date attributes. You should remove/replace all missing values before applying Expectation Maximum Clustering.

    Best regards,
    Marius
  • DarmeDarme Member Posts: 10 Contributor II
    Hi again,

    I added two Replace Missing Vlaues steps to the below process. One has attribute filter type , "value_type" set to text  with default set to value and replenishment set as "extra"

    The other has the value-type "date" and replenishment value of 23/4/2013.

    Still i get the same error. Am i still on the wrong path. Please help.

    Thank you very much
    Darme
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Can you please post your process setup as described in the post linked in my signature?

    Additionally, try to set a breakpoint before the clustering operator and inspect the metadata for missing values.

    Best regards,
    Marius
  • DarmeDarme Member Posts: 10 Contributor II
    Hi Marius,

    Once again thank you for your advices.
    I have attached the code of the process i am using and i believe all the required information is there.

    Since i have a very large set of data, if a breakpoint is set for clustering then i think i need to iterate for each row of data one by one.
    Is there a way to stop when a value is missing, similar to setting conditions to breakpoints?

    Thanks and Regards
    Darrshan

    Code:
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.009">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.009" expanded="true" name="Process">
        <process expanded="true" height="494" width="709">
          <operator activated="true" class="read_csv" compatibility="5.1.009" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
            <parameter key="csv_file" value="C:\Users\yahoo\Desktop\CSEtemp.csv"/>
            <parameter key="column_separators" value=","/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <parameter key="encoding" value="windows-1252"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="StockCode.true.text.attribute"/>
              <parameter key="1" value="SectorKey.true.text.attribute"/>
              <parameter key="2" value="TimeKey.true.date.attribute"/>
              <parameter key="3" value="OpenPrice.true.real.attribute"/>
              <parameter key="4" value="ClosePrice.true.real.attribute"/>
              <parameter key="5" value="NetChange.true.real.attribute"/>
              <parameter key="6" value="ChangePercentage.true.real.attribute"/>
              <parameter key="7" value="Highest.true.real.attribute"/>
              <parameter key="8" value="Lowest.true.real.attribute"/>
              <parameter key="9" value="Volume.true.integer.attribute"/>
              <parameter key="10" value="TotalValue.true.real.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="normalize" compatibility="5.1.009" expanded="true" height="94" name="Normalize" width="90" x="45" y="255">
            <parameter key="attribute_filter_type" value="value_type"/>
          </operator>
          <operator activated="true" class="replace_missing_values" compatibility="5.1.009" expanded="true" height="94" name="Replace Missing Values (3)" width="90" x="179" y="345">
            <list key="columns"/>
          </operator>
          <operator activated="true" class="replace_missing_values" compatibility="5.1.009" expanded="true" height="94" name="Replace Missing Values" width="90" x="313" y="345">
            <parameter key="attribute_filter_type" value="value_type"/>
            <parameter key="value_type" value="text"/>
            <parameter key="default" value="value"/>
            <list key="columns">
              <parameter key="SectorKey" value="value"/>
              <parameter key="StockCode" value="value"/>
              <parameter key="TimeKey" value="value"/>
            </list>
            <parameter key="replenishment_value" value="extra"/>
          </operator>
          <operator activated="true" class="replace_missing_values" compatibility="5.1.009" expanded="true" height="94" name="Replace Missing Values (2)" width="90" x="447" y="345">
            <parameter key="attribute_filter_type" value="value_type"/>
            <parameter key="value_type" value="date"/>
            <parameter key="default" value="value"/>
            <list key="columns"/>
            <parameter key="replenishment_value" value="23/4/2013"/>
          </operator>
          <operator activated="true" class="expectation_maximization_clustering" compatibility="5.1.009" expanded="true" height="76" name="Clustering" width="90" x="514" y="75">
            <parameter key="k" value="3"/>
            <parameter key="add_as_label" value="true"/>
            <parameter key="use_local_random_seed" value="true"/>
            <parameter key="inital_distribution" value="randomly assigned examples"/>
          </operator>
          <connect from_op="Read CSV" from_port="output" to_op="Normalize" to_port="example set input"/>
          <connect from_op="Normalize" from_port="example set output" to_op="Replace Missing Values (3)" to_port="example set input"/>
          <connect from_op="Replace Missing Values (3)" from_port="original" to_op="Replace Missing Values" to_port="example set input"/>
          <connect from_op="Replace Missing Values" from_port="example set output" to_op="Replace Missing Values (2)" to_port="example set input"/>
          <connect from_op="Replace Missing Values (2)" from_port="example set output" to_op="Clustering" to_port="example set"/>
          <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
          <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>[ /code]
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    No, you don't need to check each row one by one: just switch the the metadata view in the results perspective, and for each attribute you'll see the number of missing values.

    Anyway, my suspect is that in the second Replace Missing Values operator you should select valye_type nominal, polynominal or binominal instead of text (text is a special data type used only in the Text Processing extension).
    Experiment with that setting, *and* check the result with a breakpoint.

    Best regards,
    Marius
  • DarmeDarme Member Posts: 10 Contributor II
    Hi,

    As you have advised i changed the settings of Replace Missing Values operator and also changed the read csv operators data types accordingly.
    Still i am getting the same result :(

    Also i created break points before clustering and in the meta data view the "Missing value" column shows only "?" I also set break points at each step and looked at the meta data and the result was same.

    Furthermore i created the given schema on a MS SQL server evaluation edition and ran a query to retrieve null values for the given data set. The result was that there are no null values.

    Do you think something else has gone wrong? Any more information needed?

    Thanks again
    Darme
  • SkirzynskiSkirzynski Member Posts: 164 Maven
    I have tried to reproduce your error with my own data (with missings included), but your process runs without an error. Your process XML says you are still using a quite old version (5.1). Could you update RapidMiner to 5.3.8 and check again?
  • DarmeDarme Member Posts: 10 Contributor II
    Hi again,

    I updated to 5.3.008 and still get the same error. Could it be that some setting/configuration issue?
    Could you send me your xml file so that i can check it here?

    Many thanks again
    Darme
  • DarmeDarme Member Posts: 10 Contributor II
    Hi again,

    I tried out RM version 5.3.8 with modifications to the process. But still the result is same.
    I have attached herewith the xml code
    Seems something is fundamentally wrong either in the way i am doing or in the data.
    Could you please share your xml to try out with my data?

    Thanks alot
    Darme

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.008">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="read_csv" compatibility="5.3.008" expanded="true" height="60" name="Read CSV" width="90" x="45" y="120">
           <parameter key="csv_file" value="C:\Users\yahoo\Desktop\CSEtemp.csv"/>
           <parameter key="column_separators" value=","/>
           <parameter key="first_row_as_names" value="false"/>
           <list key="annotations">
             <parameter key="0" value="Name"/>
           </list>
           <parameter key="encoding" value="windows-1252"/>
           <list key="data_set_meta_data_information">
             <parameter key="0" value="StockCode.true.polynominal.attribute"/>
             <parameter key="1" value="SectorKey.true.binominal.attribute"/>
             <parameter key="2" value="TimeKey.true.date.attribute"/>
             <parameter key="3" value="OpenPrice.true.real.attribute"/>
             <parameter key="4" value="ClosePrice.true.real.attribute"/>
             <parameter key="5" value="NetChange.true.real.attribute"/>
             <parameter key="6" value="ChangePercentage.true.real.attribute"/>
             <parameter key="7" value="Highest.true.real.attribute"/>
             <parameter key="8" value="Lowest.true.real.attribute"/>
             <parameter key="9" value="Volume.true.integer.attribute"/>
             <parameter key="10" value="TotalValue.true.real.attribute"/>
           </list>
         </operator>
         <operator activated="true" class="normalize" compatibility="5.3.008" expanded="true" height="94" name="Normalize" width="90" x="45" y="255"/>
         <operator activated="true" class="replace_missing_values" compatibility="5.3.008" expanded="true" height="94" name="Replace Missing Values" width="90" x="112" y="390">
           <parameter key="attribute_filter_type" value="value_type"/>
           <parameter key="value_type" value="date"/>
           <parameter key="default" value="zero"/>
           <list key="columns"/>
         </operator>
         <operator activated="true" class="replace_missing_values" compatibility="5.3.008" expanded="true" height="94" name="Replace Missing Values (2)" width="90" x="246" y="390">
           <parameter key="attribute_filter_type" value="value_type"/>
           <parameter key="value_type" value="real"/>
           <list key="columns"/>
         </operator>
         <operator activated="true" class="replace_missing_values" compatibility="5.3.008" expanded="true" height="94" name="Replace Missing Values (3)" width="90" x="380" y="390">
           <parameter key="attribute_filter_type" value="value_type"/>
           <parameter key="value_type" value="binominal"/>
           <parameter key="default" value="value"/>
           <list key="columns">
             <parameter key="SectorKey" value="value"/>
           </list>
           <parameter key="replenishment_value" value="BFI"/>
         </operator>
         <operator activated="true" class="replace_missing_values" compatibility="5.3.008" expanded="true" height="94" name="Replace Missing Values (4)" width="90" x="514" y="390">
           <parameter key="attribute_filter_type" value="value_type"/>
           <parameter key="value_type" value="polynominal"/>
           <parameter key="default" value="value"/>
           <list key="columns">
             <parameter key="StockCode" value="value"/>
           </list>
           <parameter key="replenishment_value" value="AAAA"/>
         </operator>
         <operator activated="true" class="expectation_maximization_clustering" compatibility="5.3.008" expanded="true" height="76" name="Clustering" width="90" x="514" y="210">
           <parameter key="inital_distribution" value="randomly assigned examples"/>
         </operator>
         <connect from_op="Read CSV" from_port="output" to_op="Normalize" to_port="example set input"/>
         <connect from_op="Normalize" from_port="original" to_op="Replace Missing Values" to_port="example set input"/>
         <connect from_op="Replace Missing Values" from_port="original" to_op="Replace Missing Values (2)" to_port="example set input"/>
         <connect from_op="Replace Missing Values (2)" from_port="original" to_op="Replace Missing Values (3)" to_port="example set input"/>
         <connect from_op="Replace Missing Values (3)" from_port="original" to_op="Replace Missing Values (4)" to_port="example set input"/>
         <connect from_op="Replace Missing Values (4)" from_port="example set output" to_op="Clustering" to_port="example set"/>
         <connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
         <connect from_op="Clustering" from_port="clustered set" to_port="result 2"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
         <portSpacing port="sink_result 3" spacing="0"/>
       </process>
     </operator>
    </process>
  • SkirzynskiSkirzynski Member Posts: 164 Maven
    OK, I think I see where the problem is. It is a very subtle error I haven't seen directly in your processes. You are connecting the second port of your "Replace missing" to the next operator. The three letters "ori" indicate that this is the original output which is passed through without any changes, so, your data still contains missing values. Please use the first port "exa".

    For the NullPointerException we have already created an intern ticket.
  • DarmeDarme Member Posts: 10 Contributor II
    Many thanks for your advice

    I used the above process with using output as "exe" and got rid of the NullPointerException.
    However i have some issues with the result.

    1. In the "Replace Missing value" for date, i have provided value as zero and all of the date values have been replaced by "Jan 1, 1970"
    2. In the "Replace Missing value" for real, i have set the default value as average and in most of the columns the actual values have been replaced by the average figure
    3. In the "Replace Missing value" for binomial, i have set the default value as "BFI" and all of the actual values have been replaced with this.

    Is it possible for me to do the clustering with the actual values? Is there any reason why the tool replaces actual values with the values for replacement?

    In another experiment, keeping all of the above as same but i altered "Replace Missing value" for date, by setting a default value of 1/1/2009.Then again i got the NullPointerException.
    Could you explain this behaviour?

    Once again thank you for your understanding and continues help with this regard and hope for solutions for my questions

    Regards
    Darme
  • DarmeDarme Member Posts: 10 Contributor II
    Hi Marius,

    I managed to get results by trying out various options in the tool. Mainly I used attribute_type for all attributes rather than their data types and set one as the prediction. I guess if we keep attributes in some data types there could be nullpointer exception possibly because data type mismatches. Please correct me if I am wrong here.

    Once again thank you very much for all your help with this regard

    P.S shall I put this issue in to solved state

    Regards
    Darrshan
Sign In or Register to comment.