How to know the needed time for each operator to run in rapidminer

nourhan_taya · August 2017

Hi,

Iam applying text mining in financial markets prediction and i need to extract 1600 article from their links. When i use "get pages" operator, the running time reached 18 hours and i did not get results and i do not know when it would finish. Accordingly, i would ask if the rapidminer software is running normal or not.

(Note: i am using rapidminer 7.5. My PC uses Windows 10 and its processor is core i7 7th generation and 16 gb ram. The momory is 300 gb)

JEdward · August 2017

To add to this, if you are calling some URLs the site will slow down the call response if you are making too many in a certain period of time.
Solutions for this include:

Limiting the number of calls to batches
Increasing the number of individual IPs making the call
Adding a delay between each call

However, without knowing the site I don't know how easy or difficult this might be.

sgenzer · August 2017

hello @nourhan_taya - so process time varies a lot depending on many factors including your machine, the size and scope of the documents, etc... One thing that I can definitely tell you is that RapidMiner loves RAM and multiple core processors. FWIW, I just upgraded to 64GB of RAM with my 6-core Intel Xeon E5 to keep things humming along.

If I were you, I'd use the Sample operator and grab a small sample of your documents first. Benchmark the sample and then gently increase so you can get a sense if the full number docs is going to take 2 days or 2 years. Smiley Happy

[copying from this thread]

Scott

nourhan_taya · August 2017

Many thanks prof. sgenzer for reply. I Will try this solution?

nourhan_taya · August 2017

Hi Mr JEdward,

Many thanks for your reply. I am retrieving the links from Daily Mail archives. I have already used the third solution and maximized the delay time to1000. I will try the the other solutions but i didn't understand the the second one. Does it means increasing the number of computers doing the process?

Thanks for help

JEdward · August 2017

Yes, that's correct. Increasing the number of computers used to do the process. However, sometimes these computers aren't using different IPs, but all the outside connections go through a single pipe. (common in small companies).
To manage this you can allocate out the links to each (crawler) and have them download them individually.

There are also webcrawl services that you can pay for which will scale-out to get around any restrictions the web-host might have.

nourhan_taya · August 2017

Many thanks Mr. JEdward i really appreciate your help :smileyhappy:

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to know the needed time for each operator to run in rapidminer

Best Answer

Answers