Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
what does double_sparse_array do?
siamak_want
Member Posts: 98 Contributor II
Hi forum,
I am dealing with large text sparse datasets and due to the memory limitations, I need to know the features of RM for dealing with "sparse data sets". I found an interesting option in "Read CSV" operator, and it is "Data management". One of its choices is "double_sparse_array". Please explain what does the "double_sparse_array" do for me and how can I use this format?
I expect saving memory with using "double_sparse_array".
any help or idea would be greatly appreciated.
Thanks.
I am dealing with large text sparse datasets and due to the memory limitations, I need to know the features of RM for dealing with "sparse data sets". I found an interesting option in "Read CSV" operator, and it is "Data management". One of its choices is "double_sparse_array". Please explain what does the "double_sparse_array" do for me and how can I use this format?
I expect saving memory with using "double_sparse_array".
any help or idea would be greatly appreciated.
Thanks.
0
Answers
We call a dataset sparse if only a small fraction of entries differ from a default value. Because of the very large wordvectors this is especially the case in text mining. Typically you have hundreds of attributes where every attribute represents a word, which exists in the texts. But a single document contains just a small fraction of this words and have a zero entry for every other attribute. So the default value is 0 for most of the attributes.
These default value is stored just once and every entry with this value points to this value. This is fast nearly as fast as the map implementation, but need less memory. You should use these representation if 50% or more data share the same default value. And it is as easy as selecting the combo box to "double_sparse_array".
Best regards
Marcin
Thanks for your valuable information about sparse datasets. As you said, I need to just use a "read csv" operator and set the data management to be "double_sparse_array" and I don't need to use "read sparse" operator. Please correct me if I am wrong. Now, just there is one more question which has been occupied my mind: I think Using the "double_sparse_array" format will increase the access time in compare to the normal data management. I mean, Shall I expect consuming less memory consumption and also more execution time? If the answer is yes, please explain how much the execution time will increase?
Again thanks for your previous answer.
yes, you do not need the "Read Sparse" operator. This operator is useful to read a certain sparse data format directly, but it is not necessary to use it in your case.
I haven't done any experiments to test the performance of the data, but in my own experience there wasn't any difference which was of any value for me. I would suggest that you should use the sparse data format if you have sparse data. From the source comments: Best regards
Marcin
But I need to set the data management in my java application, because I'm using RM in my own java application. I just call an XML "process" from my own program, like this: Does anybody know how should I set the data management of my IOContainer to "double_sparse_array"?
Again any idea about this question would be greatly appreciated.
I think you have exactly pointed to my problem. I will create a DataRow as you explained. Now, I think RM is a powerful tool for data mining.
thanks for your guide.