Is it possible to crawl the links on the "IBM Watson News Explorer"?

jonas_boersch · November 2018

Hello Community,

I can't manage to crawl the links to the news articles on IBM Watson's News Explorer. The operator "crawl web" just stops after crawling the header of the web page, the links to the articles are in the "details" window on the left side of the web page.

Can someone help me find a solution, I would be very thankful. The link to the webpage is: http://news-explorer.mybluemix.net/?query=ipcc&type=unconstrained

Kind regards,
Jonas

rfuentealba · November 2018

Good Sir @jonas_boersch, I deeply apologise to inform you that your requirement is currently not feasible to achieve with the current RapidMiner tooling, because the operators developed for "Get Page" and "Crawl Web" were developed before the proliferation of JavaScript-built, API-driven websites with Vue.js, Angular.js, Ember.js or React.js. The sun has not set and to my knowledge there are two other choices:

Explore the code and find the original data sources. Seems feasible to find the REST servers on the IBM Watson's code, after a quick inspection I have made for you
Use the Selenium Web Browser, a headless Web browser that obtains the entire code and then gets the page. I would call this the hard way, because it is not easy to set up but worth the time if you retrieve pages frequently.

I apologise again for the bad news I had to inform you and can only hope one of these solutions would match your needs.

Have a good day,

Rodrigo.

sgenzer · November 2018

hmm I poked around the Bluemix site and it seems this News Explorer is quite old and not really being supported. Last doc I can find is a blog article from 2016 (https://developer.ibm.com/watson/blog/2016/01/04/exciting-updates-for-news-explorer/) that says that it uses AlchemyData News API - which leads to a dead link

Perhaps try a new news source?

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Is it possible to crawl the links on the "IBM Watson News Explorer"?

Best Answer

Answers