Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
which amazon instance to chose for a "loop in loop" process requiring a huge amount of memory
Hi everyone,
I have a "loop in loop" process:
- A loop value, with inside a process that loads an example set with 1000 reviews to filter
- the above nested in a loop attribute that loads a a dictionary => dataset of 15 columns that contains all the words to be founded in the reviews. The largest attribute contains 2500 values -rows.
It's impossible to run this process in rapidminer studio that freezes after a while, because of the number of columns that are created by the loop value operator (one column per word for each word of each attribute column of the dictionary: 12660 columns indeed.
I’ve launched first the process in rapidminer AI HUB with an instance r4.xlarge, but crashed, then I tried with a more powerfull one: r4.4xlarge (16 vCPu and 122 GiB memory), but crashed again after few minutes.
Is there a way to define the instance design, in consideration of the number of columns?
thanks in advance for any suggestion
cheers
0
Answers
Loops inside loops have a huge computational complexity, I think it's O(n^2), which is fairly undesirable in any programming language, not just RapidMiner Studio. Perhaps there is a way to simplify the search by applying some tricks? Also, it sounds like you would benefit from tokenizing words rather than using columns for your searches.
Do you mind to share your process with us so that we can check if there is anything we can do?
About your question: is there a way to define the instance design, in consideration of the number of columns?
Number of columns isn't a real measure for memory consumption unless you know exactly how large it is and how's it composed; I think your big issue isn't memory but optimization, though. (I may be wrong but worth the shot).
All the best,
Rod.
Thanks a lot for your reply.
Enclose, process file and 2 excel files (dictionary and dataset).
Normally I use local data repositories to optimise time access (and sharing the project with rapidminer AI Hub) but for sharing with you, I've changed the process with excel files and read excel operators.
The goal of this work is to create an automatic labelling process in order to create a validation dataset for a deep learning classification task. Then this dataset will be manually validated. Therefore, the output must be columns with labels (the categories of the dictionary) containing ones or zeros.
The process contains notes regarding pending questions.
Thanks a lot for any suggestion!
have a good day!
Love rapidminer capabilities and rapidminer community
Best,
Please check if this is what you are trying to achieve.
You'll need to install the Text Mining extension in case you don't currently have it.
On the process pay special attention to the vector creation parameter.
And on the prune method (specially for memory handling) this will help you keep only the Columns that are actually important avoiding those that do not have any appearances.
I made a change to output the result you expect for the first point.
For the second case could you share an example of dictionary and examples? I guess we could use some replace dictionary with some regex magic to accomplish that.
Perhaps there’s no relationship with that, but it seems to be corelated with the value of the column « verbatim size »
You just need to adjust the Remove Rating (Select Attribute Operator) inside the Loop Attributes Operator.
In that I removed all the Numeric Attributes that we had before creating the new counting Attributes.
If you change the attribute filter type to subset you can remove as many numeric attributes ( Rating and Verbatim) as you need to avoid any adding up other numbers to your totals.