importing data with null values

Legacy UserLegacy User Member Posts: 0 Newbie
edited November 2018 in Help
Is there a way to replace null values, or at least reject lines with nulls, during import? 

I am trying to import a file with scattered missing values and I can only import up to the first omission. 

The example for dealing with missing data I found in the tutorial has '?' in the data file for missing values.  My data has nothing; here is an example of my data: the 1st & 3rd lines are complete, the 2nd line is missing the 1st & last columns.
N282WN,WN,978,91,91,1525,1630,65,308,2,-1
,WN,1114,91,91,1850,1955,65,308,2,
N207WN,WN,1182,91,91,1405,1510,65,308,2,-1

This is the error I get:
[Error] Data format error in line 393: the line does not provide the expected number of columns (was: 10, expected: 11)! Stop reading...


Thanks much!!

Answers

  • steffensteffen Member Posts: 347 Maven
    Hello b2

    I copied your data into a simple text-file and loaded it with the operator "SimpleExampleSource" default settings using RapidMiner 4.2. I had no problems, the operator recognized all missing values.

    idea: maybe the line 393 of your data is corrupted, e.g. a comma is missing.

    hope this was helpful

    Steffen
  • Legacy UserLegacy User Member Posts: 0 Newbie
    Steffen,

    Thank you for your help.

    There are no missing commas.  Could it have to do with the fact that one of the missing fields is at the end or beginning of the line?  Is there an option I need to set?

    I am using version community 4.1

    I tried duplicating what you did.  I switched from ExampleSource to SimpleExampleSource and copied the input data back off this post into a new file.  I got a similar error.  This is the error:
    Error in: SimpleExampleSource (SimpleExampleSource) Could not read file  ...\twig.txt': Number of columns in line 1 was unexpected, was: 10, expected: 11

  • steffensteffen Member Posts: 347 Maven
    Hello b2

    Maybe it depends on the version. I remember something like this but I am not sure....
    Is there any specific reason you cannot switch to 4.2 ?

    greetings

    Steffen
  • Legacy UserLegacy User Member Posts: 0 Newbie
    Hi Steffen,

    You have all what is value replenishment, either replacing "unknown" values in metadata by a constant (typically zero), or by the attribute's mean. You have more sophisticated approaches where a learner trained on complete values is used to guess missing values, but I have never been able to understand how the operator works and is organized. You can use "Sparse array management" option in your (file/database)ExampleSource if needed.

    This item could be a good wiki article in "data formats" ;D

    Cheers,
      Jean-Charles.
  • steffensteffen Member Posts: 347 Maven
    Hello Jean-Charles
    jean-charles wrote:

    You have all what is value replenishment, either replacing "unknown" values in metadata by a constant (typically zero), or by the attribute's mean. You have more sophisticated approaches where a learner trained on complete values is used to guess missing values, but I have never been able to understand how the operator works and is organized.
    Yes, but not during import.
    jean-charles wrote:

    You can use "Sparse array management" option in your (file/database)ExampleSource if needed.
    Why ? As far I as see, Sparse Data Format is for data wiith a lot of missing values or a small number of different values (for efficient storage).

    This item could be a good wiki article in "data formats" ;D
    True, true...  :-[

    greetings

    Steffen
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi all,

    actually there was a bug in versions < 4.2 for reading CSV-like data with missing values at the end of lines. The new version 4.2 which is available now on our web site does no longer contain this bug and everything should work fine as Steffen has pointed out. So I would suggest to upgrade to RM 4.2.

    Cheers,
    Ingo
  • Legacy UserLegacy User Member Posts: 0 Newbie
    Thank you all very much for your help.

    I have upgraded to 4.2 and the same error occurs.  I have found that it happens when I have missing integer-type data, but not when I have missing nominal-type data.  I am beginning to think this may be a follow-on to the bug in version 4.1.

    Is there a way to have the import skip incomplete lines?

    thank you.
  • Legacy UserLegacy User Member Posts: 0 Newbie
    ExampleSource was giving me trouble.

    CSVExampleSource works fine.
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi again,

    maybe it would have worked with the ExampleSource operator, too (both operators are basically the same but with different parameter settings), so it might have something to do with quoting, line trimming, or the column separation parameter. However: good to hear it works now  ;D

    Cheers,
    Ingo
Sign In or Register to comment.