Webmining: need help for webcrawling with

TB161TB161 Member Posts: 7 Newbie
edited June 2020 in Help
Hello community members,

I am looking for a way to do web crawling. Now I have read in the forums that https websites cannot easily be crawled using the operator "Web Crawl". You would have to use a combination of "get pages" and "loop", like described (from Telconstar) , but I haven't found anything about this approach yet.

I will briefly explain what I want to crawl. I would like to crawl the properties displayed from a german real estate website (immowelt.de).

Typically, the location can be accessed via a link; Room from; Roomto; buy or rent; the order of the sorter:

immowelt.de/liste/muenchen/wohnungen/kaufen?roomi=2&rooma=2&sort=relevanz

The properties displayed are then listed, the link is made up of the constant expose and the ID of the offer, see below:

immowelt.de/projekte/expose/k2rb332

With the "web crawl" operator it would be easy, one would simply give the statement "expose" as a parameter for the crawl

How about "get pages" and "loop"? The ID doesn't count up, I would be very grateful if you could help me.
I wish you and your families a nice weekend 

Regards

TB161

Answers

  • kaymankayman Member Posts: 662 Unicorn
    A typical work flow could be like this : 

    Crawl first page and extract next to your regular content also the indicator for te amount of pages.

    For your example this would be 

    8 Objekte zum Kauf (insgesamt 141 Wohneinheiten im Projekt)


    So we know there are 8 in total, and the site shows 6 on a page so we can create a macro that stores our pages (ceiling of 8 divided by 6 gives 2 pages)

    Next you need to do some reverse page engineering to understand how a website moves from one page 2 another. If you are lucky it's something like mysite.com/page?nextpage=2 so you create a loop flow where you crawl the page but increment the page parameters each time so like

    mysite.com/page?nextpage=3
    mysite.com/page?nextpage=4
    ... 

    Till the last page you need

    Now, your page seems to load dynamically (not moving to a new page but just adding on the previous load) so it's not straight forward in this case. You'll probably need to look at the page load sequence (using Google inspect - network) to see which page is loaded behind the scenes. 

    Hope this gets you started
  • TB161TB161 Member Posts: 7 Newbie
    Hello Kayman,

    thank you for your suggestions...I tried it the last days, but unfortunately my experience is limited.
    Therefore I use Parsehub for crawling, the rest I will do in redmine.

    Thanks for your support !!

    regards TB
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    You could also store the html of the original page with your query results, and then extract all the links out of that page (using regular expressions) and put them in a csv file, and  then use the "Get Pages" operator instead.  Either way some creative workarounds are needed here.  How I wish RapidMiner would fix the https issue for the Crawl Web operator!
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • TB161TB161 Member Posts: 7 Newbie
    Hello Brian...

    good idea, this could fly....but isn't it hat the html have only the "first" page...?

    When I teh results have several pages, I don't know how to crawl them.

    Regards

    TB
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    My suggestion assumes that the html links for all the pages with the property ids must be embedded in the raw html of the page somewhere (don't you click on a specific property to view it).  So you can save that raw html as a document, then use document processing to extract all the links, then put those links in a file to use at the input to the Get Pages operator.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.