Web Mining, Crawl Web crawling rules...please explain?

CashCash Member Posts: 11 Contributor II
I used RapidMiner in my MBA program and it's been almost three years since I last touched it.  I just started a position where I'll be using it again and I'm a bit rusty.  I'm trying to scrape a site for some data (names, phone numbers, addresses, etc.) and put them into an excel file, however I'm not able to figure out the parameters.  I think my main issue is understanding what the crawling rules are.  What do they mean?  Which should I be applying?  I've Googled this and searched here, but I only get instructions specific to other users' questions.  Can anyone provide a definition of what these are and what they mean/do?

Best Answer


  • [Deleted User][Deleted User] Posts: 0 Learner III
  • CashCash Member Posts: 11 Contributor II
    All I see is a brief description of Web Mining and the option to download it.
  • kaymankayman Member Posts: 509   Unicorn
    The web crawling field is so wide and very depending on the structure of a website/page that it would help if you give some examples of what (sides) you want to crawl and what you would need from a page.

  • CashCash Member Posts: 11 Contributor II
    @kayman the site I'm trying to build a list from is here:  https://www.naadac.org/sap-directory?locsearch=22314&loccountry=US&locdistance=any&sortdir=distance-asc

    I'm just trying to capture the names, locations, and phone numbers.  I used Selector Gadget to help me figure out the CSS tags I need and this is what it has given me:  .places-app-location-citystatezip , a , .places-app-location-street , .places-app-location-name
  • CashCash Member Posts: 11 Contributor II
    @kayman Thanks!  That seems like a solid solution, albeit a bit out of my scope of ability.  If it borders ethical crawling it's something I'd tend to stay away from since this is for work and I don't want to do anything that might be questionable under our company policy.  I'll see about getting the data another way...or even just manually copying it.  Thanks again!
