🎉 🎉   RAPIDMINER 9.5 BETA IS OUT!!!   🎉 🎉
GRAB THE HOTTEST NEW BETA OF RAPIDMINER STUDIO, SERVER, AND RADOOP. LET US KNOW WHAT YOU THINK!
🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤
We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.
Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!
"Web crawling -overcome memory limit, split urls into susamples and then combine"
I retrieve data from several web pages (>30000) with the "get pages" operator. I have imported all my urls to the repository from the excel file. Then I process the information with regex (I extract several categories) and write the information about categories to excel in a separate raw for each url. My process works fine with small number of urls but my computer does not have enough memory to process all web pages at once. I would like to split them into pieces like 2000 urls each and do this process separately. At the end I will join excel files together. I looked at sampling operators, but most of them produce random sample. I want to keep the order in which the urls are crawled (if possible). I think I need to write a loop, but I cannot figure out where to start. For example I do not know which loop operator to use and how to make it to write several excel files or sheets with different names (1-x). Could anabody help me with that.