ANNOUNCEMENT: RAPIDMINER 9.1 BETA HAS BEEN RELEASED TODAY!   PLEASE DOWNLOAD AND GIVE FEEDBACK. ENJOY AND HAPPY RAPIDMINING!   -- @sgenzer – Community Manager

Dealing with really large text / csv files

kaymankayman Member Posts: 259   Unicorn
edited November 9 in Help

I have to deal with a huge tab seperated file, containing thousands of lines but also thousands of columns, and the latter is bringing my system to it's knees. It takes ages to load even a small amount of records because the system probably wants to validate all of the content for all of these attributes.

 

So my idea was to load it as a text file, then chop it into rows using a regex looking for end of line, and then chop into attributes looking for the tab character, no questions asked.

 

part 1 was easy, and indeed my file now loads into seconds rather than a whole lot of minutes. But I'm not sure how to actually convert the rest of my data to columns using the cut command as this would require me to name all of my columns.

 

is there a better way to achieve this, or using different logic / operators?

Tagged:

Best Answer

  • jczogallajczogalla Posts: 77   RM Engineering
    Solution Accepted

    Hi @kayman,

     

    you could potentially do somethign like the following:

    1) Split the document with Cut Document into collection of rows

    1a) Use the Replace Tokens operator with "(.*?)" and $1 to unwrap string values if necessary

    2) Use Documents to Data operator to get an example set

    3) Select only the text attribute

    4) Use the Split operator with \t as delimiter with an ordered split to create genericly named attributes

    5) Rename columns with Rename by Example Values (row 1)

     

    This should do exactly what you want (I hope). The last step even deletes the name row from the example set :)

    I hope this helps!
    Jan

Answers

  • kaymankayman Member Posts: 259   Unicorn

    Brilliant, I overlooked the split operator but indeed that worked perfectly !

    jczogalla
Sign In or Register to comment.