Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Is it possible to crawl the links on the "IBM Watson News Explorer"?
jonas_boersch
Member Posts: 1 Learner I
Hello Community,
I can't manage to crawl the links to the news articles on IBM Watson's News Explorer. The operator "crawl web" just stops after crawling the header of the web page, the links to the articles are in the "details" window on the left side of the web page.
Can someone help me find a solution, I would be very thankful. The link to the webpage is: http://news-explorer.mybluemix.net/?query=ipcc&type=unconstrained
Kind regards,
Jonas
I can't manage to crawl the links to the news articles on IBM Watson's News Explorer. The operator "crawl web" just stops after crawling the header of the web page, the links to the articles are in the "details" window on the left side of the web page.
Can someone help me find a solution, I would be very thankful. The link to the webpage is: http://news-explorer.mybluemix.net/?query=ipcc&type=unconstrained
Kind regards,
Jonas
Tagged:
0
Best Answer
-
rfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 UnicornGood Sir @jonas_boersch, I deeply apologise to inform you that your requirement is currently not feasible to achieve with the current RapidMiner tooling, because the operators developed for "Get Page" and "Crawl Web" were developed before the proliferation of JavaScript-built, API-driven websites with Vue.js, Angular.js, Ember.js or React.js. The sun has not set and to my knowledge there are two other choices:
- Explore the code and find the original data sources. Seems feasible to find the REST servers on the IBM Watson's code, after a quick inspection I have made for you
- Use the Selenium Web Browser, a headless Web browser that obtains the entire code and then gets the page. I would call this the hard way, because it is not easy to set up but worth the time if you retrieve pages frequently.
Have a good day,
Rodrigo.1
Answers
Scott