RapidMiner

Contributor II kludikovsky
Contributor II

Is there an operator which allows to process files by regex

I have a file which is basically a .CSV-file but it is missformed and need to be corrected before it can be used. (Just to clarify: there is no option to change this.)

I have tried to clean up this file in RM before any further processing, but have failed.

It can be easily cleaned by applying a sequence of regex's.

Just to give an impression of what I would need to do:

  • remove lines before the header line
  • replace the header ilne
  • eliminate some lines
  • combine two consecutive lines into one line

Yes I could do this manually upfront, but this is not the intention as it should be a repetitive process.

 

An option would be to write a Python procedure.

But maybe there is already something out there.

5 REPLIES
Community Manager Community Manager
Community Manager

Re: Is there an operator which allows to process files by regex

hi @kludikovsky - so this may be too simple a solution but have you tried just using the regex built into the Select Attributes / Filter Examples operators for getting rid of rows/columns via regex?  And if the rows to be deleted are always in the same place, I would use Filter Example Range.

 

Scott

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Contributor II kludikovsky
Contributor II

Re: Is there an operator which allows to process files by regex

Hi @sgenzer,

 

I mightbe a on the wrong path, but all those functions require to have an example set as the input.

 

I am at the stage before having an example set.

 

I attach the file.

Community Manager Community Manager
Community Manager

Re: Is there an operator which allows to process files by regex

ok I understand.  Yes I would use the Read CSV operator first to "convert" to an example set, use those operators, and then go back to Write CSV if you want.  If you really want to make changes on the actual CSV file without converting to example set, I would treat the CSV as a document and then use the text operators.  But that sounds pretty icky to me!

 

Scott

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
RM Staff
RM Staff

Re: Is there an operator which allows to process files by regex

Just an idea,

 

Operator toolbox has an operator called Read Lines (or so?) which gives you a collection of documents with one line of the document each. Afterwards, it's possible to use Loop Collection and Extract information to do line wise parsing.

 

Best,

Martin

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Contributor II kludikovsky
Contributor II

Re: Is there an operator which allows to process files by regex

Hi @mschmitz,

I understand it's "Split Document into Collection".
But any idea how I would be able to ensure the correct sequence of the lines afterwards

and more important how to combine two consecutive lines into one line?

 

Regards,
Kurt

Polls
How can RapidMiner increase participation in our new competitions?
Twitter Feed