data ETL for multiple text line file

Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
edited December 2018 in Help

I have a data extract from a system that is definitely not data-mining friendly.  Essentially the data is coming in a csv containing text blocks (paragraphs) where each block is an example and each line is an attribute, and the same relative line in each block is the same attribute, like this:

line 1

line 2

line 3

 

line 4

line 5

line 6

...

where line 1 and line 4 are the same attribute, line 2 & line 5, line 3 & line 6, etc.  A sample file is attached--the format is accurate but the data have been anonymized.  Each line may contain spaces or other special characters.  A blank line separates each block from the others.  The blocks all have the same number of lines so at least there are no missings to worry about.

 

I'd like to turn this into a typical exampleset where each block is an example with the relevant attributes, but I'm struggling to do the necessary conversions.  Any ideas or help from the community?  Thanks in advance!

 

 

Brian T.
Lindon Ventures 
Data Science Consulting from Certified RapidMiner Experts

Best Answer

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Solution Accepted

    Hi,

     

    I don't have a complete solution but I've done similar stuff before.

     

    You might want to read the file with Read CSV making sure that the entire lines end up in the result (e. g. separator = $, no header line).

    Then add an ID with Generate ID.

    Calculate the modulo so you have the same number for the same attribute.

    Pivot by the calculated modulo attribute.

     

    Regards,

     

    Balázs

     

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Thanks @BalazsBarany.  Very clever!  I was able to create a solution by using Generate Attribute for both a modulus attribute to define the attribute and a floor attribute to define the index, and then using those to pivot.  Quite simple actually.  

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.