Returning website HTML code

bkingbking Member Posts: 4 Contributor I
edited November 2018 in Help

I am a Rapidminer learner and need to be able to download the html code for any given website in order to determine if any of the accompanying pages include some form of login, form submission or other workflow. The thought is to download the html code and then search for identifiers unique to such finctionality. My question is:


a) Is this the best way to accomplish the task?

b) What is the best sequence of operators to do so?


Thank you in advance for your help, it is greatly appreciated. BK

Best Answer

  • Options
    Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    Solution Accepted

    If you have mulitple pages to retrieve, you can also use a csv file of URLs with the "Get Pages" operator.

    And if you need to crawl through an entire site, then the very useful "Crawl Web" operator allows you to specify crawling rules and crawling depth and save all retrieved pages as html files, so it is perfect for your use case.  Just be sure that you observe any crawling rules as posted in the T&C on sites that you are scraping.


    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts


  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @bking - sure...my first thought is to use the "Get Page" operator in the Web Mining extension. That should do the trick nicely.



  • Options
    bkingbking Member Posts: 4 Contributor I

    Thank you, Scott & Brian. Very Helpful...I used the Crawl Web and filter examples to grab individual page html and then filter based on keywords (yet to be defined by the web development team). Will keep you posted as the project develops, thank you again.



Sign In or Register to comment.