Keep rmhdf5 size under control

kayman · November 2021

Hi there, when storing recordsets with text content the size of the rmhdf5 files seems to behave quite weird. I have files that in their original xml format are like 1M blow up to over 2 gigabytes when converting them to recordsets and store them as hdf5.

Is this a known problem? Or is their a way to convert to the old format again as this fills my disc a bit too fast? Loading them is no real issue, this goes pretty fast, it's just the file size that has me puzzled.

It's also not very consistent, other files similar in size can take up only a few kilobytes, so it seems like it happens only occasionally, but for me without any real reason as structure and content are quite similar.

Marco_Boeck · January 2022

Hi @kayman,

Thank you for the very detailed write-up! We will look at ways how this can be mitigated in the future, because it indeed is quite undesirable when working with text.

Unfortunately, the old way of storing data cannot be brought back. As this is a very specific scenario, maybe a workaround would be to store the data as .xlxs or .csv for now? I know it's not a very appealing thought, but it would at least alleviate the size issue here.

Regards,
Marco

MartinLiebig · November 2021

Adding @jczogalla as the expert on it.

jczogalla · November 2021

Hi @kayman!

That sounds indeed a bit strange... can you share the xml files? And a minimal process?
It is possible that if text data is repeating a lot, it can be very small. But it is also possible that if e.g. in a nominal column there is one very long string and all others are very small, that this could blow up the file size.

kayman · January 2022

Hi @jczogalla, @mschmitz

Apologies for late reply, and best wishes for the new year.
The process is basically very simple, it's getting some html files, and converting them to a recordset where the original source data (say the (X)HTML body content) is stored in an attribute for further analysis later onwards.

I start my process just downloading a lot of HTML, and convert them to XML before I load them into RM in batches.

So assume I have some files called 1022102.xml, 1022103.xml and so on, these get loaded in bath and stored as 1022 in my repository. These are combined in batches as there can be thousands of files to be handled and I ran into problems with this on some operating systems, and it saves me some time later on in the process because there is less I/O handling.

The filesize on disk of (in this example) all files in the 1022xxxx range is around 7MB in total, for 327 files. Biggest file in the batch is 712K, just for reference.

Loading these into RM after some cleaning (HTML stripping to plain text) results in an RM repository of 20MB (so times 3 filesize), with biggest file now 3MB in hdf5 format. Not too much of an issue yet, just seems that in average when dealing with text it appears hdf5 is 3 times the size of standard.

Each RM repo entry now contains 6 regular attributes, where one is a text container. I've attached a single example (1022496.zip), which is a simple set of 15 examples. The content is open local government data, so no risk apart getting bored.

The problem grows when I take it to the next step, and start to split the content into paragraphs and combine all in batches. The process is again fairly straightforward, I look at the text row (artikelVersie) and split these whenever I find more than one line break. That's when the thing starts to explode content wise, and the combined 20MB suddenly becomes 3.2GB (attached as example)

Now, indeed some of these examples can contain a lot of text (when there were for instance tables involved in the original files) while others contain little to no data. In the example there are a few entries containing 200K~300K characters while the average is around 500 characters. So if the format is looking at the biggest size and providing this for all examples it would indeed explain the blowout.

I've never had this with the old file format, and while the hdf5 format can be quite handy in most scenarios it's pretty annoying in this specific situation, where I need to combine files into batches as otherwise the qty of files in a folder can exceed limits also.

Would there be a way to choose the export format? So old style instead of hdf5 depending on the situation?
I understand this is kind of exceptional and no (quick) fix may be around, but I can imagen that people working with variable sized text data can get the same problem rather quickly yet less noticeable.

Thanks again!

kayman · January 2022

Thanks @Marco_Boeck,
I just decided to split the content in more batches, rather than in a single file. This kept the size under control and while not optimal it does the job also. And there are indeed always other alternatives.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Keep rmhdf5 size under control

Best Answer

Answers