The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

How can i loop the Get Page Operator?

RaffiHRaffiH Member Posts: 9 Learner I
edited March 2020 in Help
For my bachelor thesis I need to get the text of many websites. I know how I get the text of just one page of the website but I need it for every page of the website. How can I achieve that?

Thank you very much in advance and stay healthy!

Answers

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @RaffiH this is classic webscraping. So the first question I always have to ask is: what is the legality of scraping these pages? Are they under the public domain, or are they proprietary? If it is the latter, you will most likely be violating their Terms of Service.

    Scott

  • Options
    RaffiHRaffiH Member Posts: 9 Learner I
    @sgenzer Thank you very much for your answer! 

    I want to analyze the sentiment of different company websites. Also the words they use the most. 
    This would be a company website I would like to analyze: https://www.breeze-technologies.de/.

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    hi @RaffiH oh in that case why don't you just contact the company on their website and ask their permission? That way you can be sure.

    Scott
  • Options
    RaffiHRaffiH Member Posts: 9 Learner I
    Hi @sgenzer okay and after I've done that? What do I have to do next?

    Rafael
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    just use Get Page on the URLs
  • Options
    RaffiHRaffiH Member Posts: 9 Learner I
    But if I not want just the first page? How can I get every Page from an URLs

    Thank you very much for your help!
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    @RaffiH then you're in a web crawling scenario - something that we generally do not condone. If you are 100% sure that you have the legal and ethical go-ahead to crawl the site, I'd use a python library like scrapy https://scrapy.org
  • Options
    RaffiHRaffiH Member Posts: 9 Learner I
    Thank you very much @sgenzer but why is it not possible to loop Get Page? I just need the text on the website.
  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    oh sure you can use Get Page inside a Loop operator, but you'll need to loop over values in an ExampleSet with the URLs in it. If you have a nice list of all the URLs of the site, that will work fine.
Sign In or Register to comment.