Compete in RapidMiner's 3rd Competition: Fantasy Football. Top prize is $750. Deadline December 19.
Download RapidMiner Studio or Server 8.0 Public Beta. Let us know how you like it! Ends November 27.
Watch RapidMiner's "Getting Started" videos on YouTube. Everything you need to do data science - fast and simple!
Iam applying text mining in financial markets prediction and i need to extract 1600 article from their links. When i use "get pages" operator, the running time reached 18 hours and i did not get results and i do not know when it would finish. Accordingly, i would ask if the rapidminer software is running normal or not.
(Note: i am using rapidminer 7.5. My PC uses Windows 10 and its processor is core i7 7th generation and 16 gb ram. The momory is 300 gb)
Solved! Go to Solution.
hello @nourhan_taya - so process time varies a lot depending on many factors including your machine, the size and scope of the documents, etc... One thing that I can definitely tell you is that RapidMiner loves RAM and multiple core processors. FWIW, I just upgraded to 64GB of RAM with my 6-core Intel Xeon E5 to keep things humming along.
If I were you, I'd use the Sample operator and grab a small sample of your documents first. Benchmark the sample and then gently increase so you can get a sense if the full number docs is going to take 2 days or 2 years.
[copying from this thread]
To add to this, if you are calling some URLs the site will slow down the call response if you are making too many in a certain period of time.
Solutions for this include:
However, without knowing the site I don't know how easy or difficult this might be.
Hi Mr JEdward,
Many thanks for your reply. I am retrieving the links from Daily Mail archives. I have already used the third solution and maximized the delay time to1000. I will try the the other solutions but i didn't understand the the second one. Does it means increasing the number of computers doing the process?
Thanks for help
Yes, that's correct. Increasing the number of computers used to do the process. However, sometimes these computers aren't using different IPs, but all the outside connections go through a single pipe. (common in small companies).
To manage this you can allocate out the links to each (crawler) and have them download them individually.
There are also webcrawl services that you can pay for which will scale-out to get around any restrictions the web-host might have.