Options

Is it possible to crawl the links on the "IBM Watson News Explorer"?

jonas_boerschjonas_boersch Member Posts: 1 Learner I
edited January 2019 in Help
Hello Community,

I can't manage to crawl the links to the news articles on IBM Watson's News Explorer. The operator "crawl web" just stops after crawling the header of the web page, the links to the articles are in the "details" window on the left side of the web page.

Can someone help me find a solution, I would be very thankful. The link to the webpage is: http://news-explorer.mybluemix.net/?query=ipcc&type=unconstrained

Kind regards,
Jonas
Tagged:

Best Answer

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Solution Accepted
    Good Sir @jonas_boersch, I deeply apologise to inform you that your requirement is currently not feasible to achieve with the current RapidMiner tooling, because the operators developed for "Get Page" and "Crawl Web" were developed before the proliferation of JavaScript-built, API-driven websites with Vue.js, Angular.js, Ember.js or React.js. The sun has not set and to my knowledge there are two other choices:
    • Explore the code and find the original data sources. Seems feasible to find the REST servers on the IBM Watson's code, after a quick inspection I have made for you
    • Use the Selenium Web Browser, a headless Web browser that obtains the entire code and then gets the page. I would call this the hard way, because it is not easy to set up but worth the time if you retrieve pages frequently.
    I apologise again for the bad news I had to inform you and can only hope one of these solutions would match your needs.

    Have a good day,

    Rodrigo.

Answers

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    edited November 2018
    hmm I poked around the Bluemix site and it seems this News Explorer is quite old and not really being supported. Last doc I can find is a blog article from 2016 (https://developer.ibm.com/watson/blog/2016/01/04/exciting-updates-for-news-explorer/) that says that it uses AlchemyData News API - which leads to a dead link :( Perhaps try a new news source?

    Scott
Sign In or Register to comment.