Trouble with URLs

chuckbchuckb Member Posts: 2 Contributor I
edited November 2018 in Help

I'm just getting started with data mining and am having trouble with the URLs

I'm trying to scrape google shopping for prices. The URL I have is:


In the crawl rules I have:
store_with_matching_url = .+seller.+
follow_links_with_matching_url = .+start.+

I have two problems. 1) the first page does not store. I get an error saying the url does not have the filter results in it and 2) it does not follow the link.

I'm not sure how to fix this.

Also, is there a way to pull the ?q=036725235182 from my database? The number is the UPC of my products. Ultimately I would like to query 10,000+ records and crawl all upc's in the database one at a time. If  any one knows of some examples to get my project off the ground I would be much appreciated.

Thanks in advance,



  • Options
    Miguel_B_scherMiguel_B_scher Member Posts: 9 Contributor II
    Hello chuckb.
    If I am right you want to store the seller urls right? Not the price, and not the url under "Relevance".
    Instead of using the Crawl Web Word just use a xpath or regular expression to extract the url you can do this with the Cut Document operator after a "Get Page" Operator.
    For the regular expression you just could cut urls with "http://www.google.com/products/seller?" in the url link. You will get all right urls with it.
    Of course you will have to crawl all other sites with an xpath oder regular expression also. (1 - 25 of 80).

    Just play a little bit with the cut document / extract information operators. With GetPage you just type your URL that you want to crawl.

  • Options
    chuckbchuckb Member Posts: 2 Contributor I
    So I'm able to store the page as expected but I'm having trouble following the next link. It's a javascript link...

    Does anyone know the proper xpath to follow this link. Below is the content of the link:

    <a onmousedown="return logClick('\x2Fproducts\x2Fcatalog?hl=en\x26q=036725235182\x26cid=6520112632679181641\x26cpo=1\x26sa=N\x26start=5', 'electronics', 'Overview', 'tabless', '6520112632679181641', 'ps-sellers-frame_Next \x26raquo\x3B')" onclick="reloadSection('#start=5', 'ps-sellers');" href="javascript:void(0);">Next »</a>
  • Options
    colocolo Member Posts: 236 Maven
    Hi Chuck,

    I think it's not possible to analyse attribute contents with XPath. You will probably have to fetch the attribute containing the relevant data and then use regular expressions or a custom script to extract the parts you need and build a valid URL from this afterwards.

Sign In or Register to comment.