RAPIDMINER 9.7 BETA ANNOUNCEMENT
The beta program for the RapidMiner 9.7 release is now available. Lots of amazing new improvements including true version control!
"Web crawling -overcome memory limit, split urls into susamples and then combine"
I retrieve data from several web pages (>30000) with the "get pages" operator. I have imported all my urls to the repository from the excel file. Then I process the information with regex (I extract several categories) and write the information about categories to excel in a separate raw for each url. My process works fine with small number of urls but my computer does not have enough memory to process all web pages at once. I would like to split them into pieces like 2000 urls each and do this process separately. At the end I will join excel files together. I looked at sampling operators, but most of them produce random sample. I want to keep the order in which the urls are crawled (if possible). I think I need to write a loop, but I cannot figure out where to start. For example I do not know which loop operator to use and how to make it to write several excel files or sheets with different names (1-x). Could anabody help me with that.