Spliting rows that has values in wrong type

hkhk Member Posts: 1 Contributor I
edited December 2018 in Help

Hi,

I had a data set of 8 millon rows in a txt file with tab delimited format without quotes.

I had 5 of the 14 columns with date values in dd.MM.yyyy format.

Problem 1

I am trying to import the file. In "Format your colums" step, if I choose the type of that colums as "date", it gives errors and all cells in columns turns "?"

So I selected "polynomial" and planed to convert attribute type to date later.

Problem 2 (the real one)

I imported the data and put "nominal to date" operator. When I run I got error in line 14.899:

 

Cannot parse date: Unparseable date: "0"

I find the line and I see that columns separated wrong. There was a tab character in a string in the a prior cell. So values moved one cell right. And this row was not the only one that moved.

 

I want to split the rows that has the values in wrong data type for spesified attributes. So I cant correct them manually.

How can I do that in Rapidminer?

 

Or any other ideas to figure theese problems out?

 

 

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @hk - welcome to the community. Some quick thoughts (would be much easier to help if you post XML and sample data - hint hint)...

     

    Problem 1: "changing it later" is always the approach I take. It's much easier in the long run.

     

    Problem 2: hard to debug without seeing your data and XML but off the cuff, try using "Handle Exception" as an error handling tool for cases like this. Another method is to use "Generate Attributes" instead of Nominal to Date, and using a form of if-then statements for this.

     

    Scott

     

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @hk

     

    Just my 5 cents. I often have to work with data files coming from different people and sometimes data comes in really messy formats and with weird errors. So sometimes, if it is CSV or TSV file it's much faster and easier for me to make things a bit nicier using just a text editor like (Sublime Text or similar) before passing it to RM process.

     

    In your case, if you have tab separated values, you should expect that file does not contain 2 consecutive tabs, right? I would just straightforwardly used Sublime Text for replacing 2 consecutive tabs to 1 tab symbol with find and replace function, or did some other plain text transformations to avoid potential problems like this one. 

Sign In or Register to comment.