RapidMiner

RM Certified Expert
RM Certified Expert

data ETL for multiple text line file

I have a data extract from a system that is definitely not data-mining friendly.  Essentially the data is coming in a csv containing text blocks (paragraphs) where each block is an example and each line is an attribute, and the same relative line in each block is the same attribute, like this:

line 1

line 2

line 3

 

line 4

line 5

line 6

...

where line 1 and line 4 are the same attribute, line 2 & line 5, line 3 & line 6, etc.  A sample file is attached--the format is accurate but the data have been anonymized.  Each line may contain spaces or other special characters.  A blank line separates each block from the others.  The blocks all have the same number of lines so at least there are no missings to worry about.

 

I'd like to turn this into a typical exampleset where each block is an example with the relevant attributes, but I'm struggling to do the necessary conversions.  Any ideas or help from the community?  Thanks in advance!

 

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
2 REPLIES
Highlighted
Community Manager Community Manager
Community Manager
Solution

Re: data ETL for multiple text line file

Hi,

 

I don't have a complete solution but I've done similar stuff before.

 

You might want to read the file with Read CSV making sure that the entire lines end up in the result (e. g. separator = $, no header line).

Then add an ID with Generate ID.

Calculate the modulo so you have the same number for the same attribute.

Pivot by the calculated modulo attribute.

 

Regards,

 

Balázs

 

--
Balázs Bárány
Data Scientist, Vienna
https://datascientist.at
RM Certified Expert
RM Certified Expert

Re: data ETL for multiple text line file

Thanks @BalazsBarany.  Very clever!  I was able to create a solution by using Generate Attribute for both a modulus attribute to define the attribute and a floor attribute to define the index, and then using those to pivot.  Quite simple actually.  

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
Polls
How can RapidMiner increase participation in our new competitions?
Twitter Feed