Trouble with URLs

chuckb · September 2011

Hi,

I'm just getting started with data mining and am having trouble with the URLs

I'm trying to scrape google shopping for prices. The URL I have is:

http://www.google.com/products/catalog?q=036725235182&;hl=en&cid=6520112632679181641&os=sellers

In the crawl rules I have:
store_with_matching_url = .+seller.+
follow_links_with_matching_url = .+start.+

I have two problems. 1) the first page does not store. I get an error saying the url does not have the filter results in it and 2) it does not follow the link.

I'm not sure how to fix this.

Also, is there a way to pull the ?q=036725235182 from my database? The number is the UPC of my products. Ultimately I would like to query 10,000+ records and crawl all upc's in the database one at a time. If any one knows of some examples to get my project off the ground I would be much appreciated.

Thanks in advance,

Chuck

Miguel_B_scher · September 2011

Hello chuckb.
If I am right you want to store the seller urls right? Not the price, and not the url under "Relevance".
Instead of using the Crawl Web Word just use a xpath or regular expression to extract the url you can do this with the Cut Document operator after a "Get Page" Operator.
For the regular expression you just could cut urls with "http://www.google.com/products/seller?" in the url link. You will get all right urls with it.
Of course you will have to crawl all other sites with an xpath oder regular expression also. (1 - 25 of 80).

Just play a little bit with the cut document / extract information operators. With GetPage you just type your URL that you want to crawl.

Bye
Miguel

chuckb · September 2011

So I'm able to store the page as expected but I'm having trouble following the next link. It's a javascript link...

Does anyone know the proper xpath to follow this link. Below is the content of the link:

<a onmousedown="return logClick('\x2Fproducts\x2Fcatalog?hl=en\x26q=036725235182\x26cid=6520112632679181641\x26cpo=1\x26sa=N\x26start=5', 'electronics', 'Overview', 'tabless', '6520112632679181641', 'ps-sellers-frame_Next \x26raquo\x3B')" onclick="reloadSection('#start=5', 'ps-sellers');" href="javascript:void(0);">Next »</a>

colo · September 2011

Hi Chuck,

I think it's not possible to analyse attribute contents with XPath. You will probably have to fetch the attribute containing the relevant data and then use regular expressions or a custom script to extract the parts you need and build a valid URL from this afterwards.

Regards
Matthias

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Trouble with URLs

Answers