Dealing with really large text / csv files

kayman · October 2018

I have to deal with a huge tab seperated file, containing thousands of lines but also thousands of columns, and the latter is bringing my system to it's knees. It takes ages to load even a small amount of records because the system probably wants to validate all of the content for all of these attributes.

So my idea was to load it as a text file, then chop it into rows using a regex looking for end of line, and then chop into attributes looking for the tab character, no questions asked.

part 1 was easy, and indeed my file now loads into seconds rather than a whole lot of minutes. But I'm not sure how to actually convert the rest of my data to columns using the cut command as this would require me to name all of my columns.

is there a better way to achieve this, or using different logic / operators?

jczogalla · October 2018

Hi @kayman,

you could potentially do somethign like the following:

1) Split the document with Cut Document into collection of rows

1a) Use the Replace Tokens operator with "(.*?)" and $1 to unwrap string values if necessary

2) Use Documents to Data operator to get an example set

3) Select only the text attribute

4) Use the Split operator with \t as delimiter with an ordered split to create genericly named attributes

5) Rename columns with Rename by Example Values (row 1)

This should do exactly what you want (I hope). The last step even deletes the name row from the example set

I hope this helps!
Jan

kayman · October 2018

Brilliant, I overlooked the split operator but indeed that worked perfectly !

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Dealing with really large text / csv files

Best Answer

Answers