How to know the needed time for each operator to run in rapidminer

nourhan_tayanourhan_taya Member Posts: 11 Contributor I
edited November 2018 in Help

Hi, 

Iam applying text mining in financial markets prediction and i need to extract 1600 article from their links. When i use "get pages" operator, the running time reached 18 hours and i did not get results and i do not know when it would finish. Accordingly, i would ask if the rapidminer software is running normal or not.

(Note: i am using rapidminer 7.5. My PC uses Windows 10 and its processor is core i7 7th generation and 16 gb ram. The momory is 300 gb)

Best Answer

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Solution Accepted

    To add to this, if you are calling some URLs the site will slow down the call response if you are making too many in a certain period of time.  
    Solutions for this include: 

    • Limiting the number of calls to batches
    • Increasing the number of individual IPs making the call
    • Adding a delay between each call

    However, without knowing the site I don't know how easy or difficult this might be.  

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello  @nourhan_taya - so process time varies a lot depending on many factors including your machine, the size and scope of the documents, etc...  One thing that I can definitely tell you is that RapidMiner loves RAM and multiple core processors.  FWIW, I just upgraded to 64GB of RAM with my 6-core Intel Xeon E5 to keep things humming along.

     

    If I were you, I'd use the Sample operator and grab a small sample of your documents first.  Benchmark the sample and then gently increase so you can get a sense if the full number docs is going to take 2 days or 2 years.  Smiley Happy

     

    [copying from this thread]

     

    Scott

  • nourhan_tayanourhan_taya Member Posts: 11 Contributor I
    Many thanks prof. sgenzer for reply. I Will try this solution?
  • nourhan_tayanourhan_taya Member Posts: 11 Contributor I

    Hi Mr JEdward,

    Many thanks for your reply. I am retrieving the links from Daily Mail archives. I have already used the third solution and maximized the delay time to1000. I will try the the other solutions but i didn't understand the the second one. Does it means increasing the number of computers doing the process?

    Thanks for help

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn

    Yes, that's correct.  Increasing the number of computers used to do the process.  However, sometimes these computers aren't using different IPs, but all the outside connections go through a single pipe.  (common in small companies).   
    To manage this you can allocate out the links to each (crawler) and have them download them individually.  

     

    There are also webcrawl services that you can pay for which will scale-out to get around any restrictions the web-host might have.  

  • nourhan_tayanourhan_taya Member Posts: 11 Contributor I

    Many thanks Mr. JEdward i really appreciate your help :smileyhappy:

Sign In or Register to comment.