"Is there an operator which allows to process files by regex"

kludikovskykludikovsky Member Posts: 30 Maven
edited June 2019 in Help

I have a file which is basically a .CSV-file but it is missformed and need to be corrected before it can be used. (Just to clarify: there is no option to change this.)

I have tried to clean up this file in RM before any further processing, but have failed.

It can be easily cleaned by applying a sequence of regex's.

Just to give an impression of what I would need to do:

  • remove lines before the header line
  • replace the header ilne
  • eliminate some lines
  • combine two consecutive lines into one line

Yes I could do this manually upfront, but this is not the intention as it should be a repetitive process.

 

An option would be to write a Python procedure.

But maybe there is already something out there.

Tagged:

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @kludikovsky - so this may be too simple a solution but have you tried just using the regex built into the Select Attributes / Filter Examples operators for getting rid of rows/columns via regex?  And if the rows to be deleted are always in the same place, I would use Filter Example Range.

     

    Scott

     

  • kludikovskykludikovsky Member Posts: 30 Maven

    Hi @sgenzer,

     

    I mightbe a on the wrong path, but all those functions require to have an example set as the input.

     

    I am at the stage before having an example set.

     

    I attach the file.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    ok I understand.  Yes I would use the Read CSV operator first to "convert" to an example set, use those operators, and then go back to Write CSV if you want.  If you really want to make changes on the actual CSV file without converting to example set, I would treat the CSV as a document and then use the text operators.  But that sounds pretty icky to me!

     

    Scott

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Just an idea,

     

    Operator toolbox has an operator called Read Lines (or so?) which gives you a collection of documents with one line of the document each. Afterwards, it's possible to use Loop Collection and Extract information to do line wise parsing.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • kludikovskykludikovsky Member Posts: 30 Maven

    Hi @mschmitz,

    I understand it's "Split Document into Collection".
    But any idea how I would be able to ensure the correct sequence of the lines afterwards

    and more important how to combine two consecutive lines into one line?

     

    Regards,
    Kurt

Sign In or Register to comment.