Dealing with really large text / csv files

Wisdom logo Registration now open for RapidMiner Wisdom Americas | New Orleans | October 10-12, 2018   Learn More

Dealing with really large text / csv files

I have to deal with a huge tab seperated file, containing thousands of lines but also thousands of columns, and the latter is bringing my system to it's knees. It takes ages to load even a small amount of records because the system probably wants to validate all of the content for all of these attributes.


So my idea was to load it as a text file, then chop it into rows using a regex looking for end of line, and then chop into attributes looking for the tab character, no questions asked.


part 1 was easy, and indeed my file now loads into seconds rather than a whole lot of minutes. But I'm not sure how to actually convert the rest of my data to columns using the cut command as this would require me to name all of my columns.


is there a better way to achieve this, or using different logic / operators?

RM Staff RM Staff
RM Staff

Re: Dealing with really large text / csv files

Hi @kayman,


you could potentially do somethign like the following:

1) Split the document with Cut Document into collection of rows

1a) Use the Replace Tokens operator with "(.*?)" and $1 to unwrap string values if necessary

2) Use Documents to Data operator to get an example set

3) Select only the text attribute

4) Use the Split operator with \t as delimiter with an ordered split to create genericly named attributes

5) Rename columns with Rename by Example Values (row 1)


This should do exactly what you want (I hope). The last step even deletes the name row from the example set Smiley Happy

I hope this helps!


Re: Dealing with really large text / csv files

Brilliant, I overlooked the split operator but indeed that worked perfectly !