what does double_sparse_array do?

siamak_want · October 2012

Hi forum,

I am dealing with large text sparse datasets and due to the memory limitations, I need to know the features of RM for dealing with "sparse data sets". I found an interesting option in "Read CSV" operator, and it is "Data management". One of its choices is "double_sparse_array". Please explain what does the "double_sparse_array" do for me and how can I use this format?

I expect saving memory with using "double_sparse_array".

any help or idea would be greatly appreciated.

Thanks.

Skirzynski · October 2012

Hi!

We call a dataset sparse if only a small fraction of entries differ from a default value. Because of the very large wordvectors this is especially the case in text mining. Typically you have hundreds of attributes where every attribute represents a word, which exists in the texts. But a single document contains just a small fraction of this words and have a zero entry for every other attribute. So the default value is 0 for most of the attributes.

These default value is stored just once and every entry with this value points to this value. This is fast nearly as fast as the map implementation, but need less memory. You should use these representation if 50% or more data share the same default value. And it is as easy as selecting the combo box to "double_sparse_array".

Best regards
Marcin

siamak_want · October 2012

Hi Marcin,

Thanks for your valuable information about sparse datasets. As you said, I need to just use a "read csv" operator and set the data management to be "double_sparse_array" and I don't need to use "read sparse" operator. Please correct me if I am wrong. Now, just there is one more question which has been occupied my mind: I think Using the "double_sparse_array" format will increase the access time in compare to the normal data management. I mean, Shall I expect consuming less memory consumption and also more execution time? If the answer is yes, please explain how much the execution time will increase?

Again thanks for your previous answer.

Skirzynski · October 2012

Hi,

yes, you do not need the "Read Sparse" operator. This operator is useful to read a certain sparse data format directly, but it is not necessary to use it in your case.

I haven't done any experiments to test the performance of the data, but in my own experience there wasn't any difference which was of any value for me. I would suggest that you should use the sparse data format if you have sparse data. From the source comments:

Should always be used if more than 50% of the data is sparse. As fast (or even faster than map implementation) but needs considerably less memory.

Best regards
Marcin

siamak_want · October 2012

Thanks a lot Marcin, According to your explanations, I found the "double_sparse_array" format so useful for text datasets. I will use "double_sparse_array" as you mentioned.

But I need to set the data management in my java application, because I'm using RM in my own java application. I just call an XML "process" from my own program, like this:


myOwnProcess.run(myOwnIOContainer);

Does anybody know how should I set the data management of my IOContainer to "double_sparse_array"?

Again any idea about this question would be greatly appreciated.

Skirzynski · October 2012

What do you mean with "your IOContainer"? An IOContainer is just a list of IOObject like ExampleSet for instance. Do you mean that you create an ExampleSet programmatically? In that case you should use a DataRowFactory which uses this data format.


...
DataRowFactory dataRowFactory = new DataRowFactory(DataRowFactory.TYPE_BOOLEAN_SPARSE_ARRAY, '.');
DataRow row = dataRowFactory.create(2);
...

siamak_want · October 2012

Thanks Marcin,

I think you have exactly pointed to my problem. I will create a DataRow as you explained. Now, I think RM is a powerful tool for data mining.

thanks for your guide.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

what does double_sparse_array do?

Answers