Options

Web crawling, overcome mem limit, split urls into susamples and combine results

In777In777 Member Posts: 2 Contributor I
edited October 2019 in Help
Hello,

I retrieve data from several web pages (>30000) with the "get pages" operator. I have imported all my urls to the repository from the excel file.  Then I process the information with regex (I extract several categories) and write the information about categories to excel in a separate raw for each url. My process works fine with small number of urls but my computer does not have enough memory to process all web pages at once. I would like to split them into pieces like 2000 urls each and do this process separately. At the end I will join excel files together. I looked at sampling operators, but most of them produce random sample. I want to keep the order in which the urls are crawled (if possible). I think I need to use a loop, but I cannot figure out where to start. For example, I do not know which loop operator to use (I think I need loop over examples) and how to make it to write the results to several excel files ( I presume I rather need to write the results dynamically to sql database rather then excel). Could anybody help me with this issue?
Tagged:
Sign In or Register to comment.