"Web Mining crawling prices of an internet page"

luiz_vidalluiz_vidal Member Posts: 14 Contributor II
edited June 2019 in Help

Guys, 

 

I am trying to create a process to crawl web pages from a site in order to get the prices of a variety of products. I am trying to do the following, I created a loop, because I want to crawl to get page by page and save into my disk, after that I want to get this html saved into my disk and extract only the name of the product and price for example, but I'm not being able to do that. Would you guys please help me?
I was able to get the pages in sequence, but somehow I can't save into the disk as they are overwritten

 

First I want to collect the pages:

https://www.buscape.com.br/cerveja?pagina=1

https://www.buscape.com.br/cerveja?pagina=2

...

https://www.buscape.com.br/cerveja?pagina=200

Follow my process below

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="concurrency:loop" compatibility="8.0.001" expanded="true" height="103" name="Loop" width="90" x="179" y="34">
<process expanded="true">
<operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="238">
<parameter key="url" value="https://www.buscape.com.br/cerveja?pagina=%{page}"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value="cerveja"/>
</list>
<parameter key="retrieve_as_html" value="true"/>
<parameter key="add_content_as_attribute" value="true"/>
<parameter key="write_pages_to_disk" value="true"/>
<parameter key="output_dir" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja"/>
</operator>
<operator activated="true" class="generate_macro" compatibility="8.0.001" expanded="true" height="82" name="Generate Macro" width="90" x="112" y="34">
<list key="function_descriptions">
<parameter key="page" value="%{page}"/>
</list>
</operator>
<connect from_port="input 1" to_op="Generate Macro" to_port="through 1"/>
<connect from_op="Crawl Web" from_port="example set" to_port="output 2"/>
<connect from_op="Generate Macro" from_port="through 1" to_port="output 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="source_input 2" spacing="0"/>
<portSpacing port="sink_output 1" spacing="0"/>
<portSpacing port="sink_output 2" spacing="0"/>
<portSpacing port="sink_output 3" spacing="0"/>
</process>
</operator>
<connect from_op="Loop" from_port="output 1" to_port="result 2"/>
<connect from_op="Loop" from_port="output 2" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

After that when I have all pages "collected", I was trying to use xpath to get only the field I need inside the html.

But, somehow when I copy paste it from google, it doesn't work.

 

Can you guys please help me create a simple example of process ?

 

Thanks in advance.

Tagged:

Best Answer

  • luiz_vidalluiz_vidal Member Posts: 14 Contributor II
    Solution Accepted

    Ugh,

    After almost giving up I was able to retrieve the piece of data I want, the thing is that it brings only the first that it finds..

    I need to find a way to fetch all products names and prices 

    //*[@name="priceProduct"]

    //*[@name="productName"]

Answers

  • minerminer Member Posts: 13 Contributor II

    Hi Luiz-Vidal,

     

    I came across that issue a few days ago.

    Just copy&paste the xml from google wont work due to namespace

    Google gives

    //*[@id="product_383527"]/div/div[1]/div[3]/div[1]/a/span for the first product: Paulistânia Puro Malte Premium Lager Garrafa 600 ml 1 Unidade and

    //*[@id="product_383527"]/div/div[2]/div[1]/div[1]/a/span for the price 14,99

    In RM you have to use //*[@id="product_383527"]/h:div/dh:iv[1]/h:div[3]/h:div[1]/h:a/h:span

    and //*[@id="product_383527"]/h:div/h:div[2]/h:div[1]/h:div[1]/h:a/h:span

    See the discussion here: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Extracting-Information-With-XPath/td-p/9883

     

    Cheers

    miner

  • luiz_vidalluiz_vidal Member Posts: 14 Contributor II

    Hey,

    Thanks for your reply

     

    Although I still can't make it..

     

    <?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="false" class="concurrency:loop" compatibility="8.0.001" expanded="true" height="103" name="Loop" width="90" x="179" y="34">
    <parameter key="iteration_macro" value="page"/>
    <process expanded="true">
    <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="246" y="238">
    <parameter key="url" value="https://www.buscape.com.br/cerveja?pagina=%{page}"/>
    <list key="crawling_rules">
    <parameter key="follow_link_with_matching_url" value="cerveja"/>
    </list>
    <parameter key="retrieve_as_html" value="true"/>
    <parameter key="add_content_as_attribute" value="true"/>
    <parameter key="write_pages_to_disk" value="true"/>
    <parameter key="output_dir" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja\"/>
    </operator>
    <operator activated="true" class="rename_file" compatibility="8.0.001" expanded="true" height="82" name="Rename File" width="90" x="514" y="238">
    <parameter key="file" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja\0.txt"/>
    <parameter key="new_name" value="%{page}.txt"/>
    </operator>
    <operator activated="true" class="generate_macro" compatibility="8.0.001" expanded="true" height="82" name="Generate Macro" width="90" x="112" y="34">
    <list key="function_descriptions">
    <parameter key="page" value="%{page} + 1"/>
    </list>
    </operator>
    <connect from_port="input 1" to_op="Generate Macro" to_port="through 1"/>
    <connect from_op="Crawl Web" from_port="example set" to_op="Rename File" to_port="through 1"/>
    <connect from_op="Rename File" from_port="through 1" to_port="output 2"/>
    <connect from_op="Generate Macro" from_port="through 1" to_port="output 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="source_input 2" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    <portSpacing port="sink_output 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:process_document_from_file" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="112" y="187">
    <list key="text_directories">
    <parameter key="cerveja" value="C:\Users\luiz.vidal\Desktop\Cloudera\SEFA-PA\Cerveja"/>
    </list>
    <parameter key="encoding" value="ISO-8859-1"/>
    <process expanded="true">
    <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="45" y="34">
    <parameter key="query_type" value="XPath"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries"/>
    <list key="regular_region_queries"/>
    <list key="xpath_queries">
    <parameter key="product" value="//*[@id=&amp;quot;product_383527&quot;]/h:div/dh:iv[1]/h:div[3]/h:div[1]/h:a/h:span"/>
    </list>
    <list key="namespaces"/>
    <parameter key="assume_html" value="false"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    <process expanded="true">
    <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="447" y="85">
    <parameter key="query_type" value="XPath"/>
    <list key="string_machting_queries"/>
    <list key="regular_expression_queries"/>
    <list key="regular_region_queries"/>
    <list key="xpath_queries">
    <parameter key="product" value="//*[@id=&amp;quot;product_383527&quot;]/h:div/h:div[2]/h:div[1]/h:div[1]/h:a/h:span"/>
    </list>
    <list key="namespaces"/>
    <list key="index_queries"/>
    <list key="jsonpath_queries"/>
    </operator>
    <connect from_port="segment" to_op="Extract Information" to_port="document"/>
    <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
    <portSpacing port="source_segment" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_port="document" to_op="Cut Document" to_port="document"/>
    <connect from_op="Cut Document" from_port="documents" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Process Documents from Files" from_port="example set" to_port="result 1"/>
    <connect from_op="Process Documents from Files" from_port="word list" to_port="result 2"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    Any idea what am I doing wrong?

  • minerminer Member Posts: 13 Contributor II

    I´m not quite sure.

    The website is using product-id for reference.

    For the first product I took it was //*[@id="product_383527"] - assuming the id is changing for every product the xpaht is only working for this specific product.

    Then you would have to go up the tree to get a "non-id-related" node and then pick the detail from there.

    That would be /html/body/main/div[3]/div/div[3]/section?

  • luiz_vidalluiz_vidal Member Posts: 14 Contributor II

    Sorry,

    I know nothing about xpath, I've been trying all day to get ..

    <input value="Brahma Pilsen Lata 350 ml 1 Unidade" name="productName" type="hidden">

    <input value="6.75" name="priceProduct" type="hidden">

    I try, try try and the extract document returs me only true or false or ?

    name="productName"], it returns TRUE or FALSE.. but what I want is the value for productName and for priceProduct.. which will probably have to be return on a list.. or a huge string to be split.. I dont know yet.
    A victory would be just getting one value returned =)

  • minerminer Member Posts: 13 Contributor II

    Hi @luiz_vidal

     

    xpath can be a mess...

    A good way to test xpath-strings is to use google docs where you can quickly copy the xpath from chrome to the spreadsheet and test the result. This is much faster than testing the structure in RM.

    On Youtube you find a lot of tutorials to xpath and google docs.

    My recommendation is the video of community member el chief - find it here: https://www.youtube.com/watch?v=UG6223p9fZE

     

    Cheers

    miner

  • luiz_vidalluiz_vidal Member Posts: 14 Contributor II

    Overall,

     

    It was a matter of getting to know how to use xpath and configuring it correctly along the operators.

     

    Thanks for your help

     

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    "xpath can be a mess..." 

     

    Definately agree, but it's powerful when it works. 

  • canh99alexcanh99alex Member Posts: 1 Contributor I

    Help me please. Which Currency is best to mine. https://en.bitcoinwiki.org/wiki/Web_mining here it is written that experts advice "monero".

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Hi,

     

    Trying the XPaths in a shell environment can make things faster.

     

    A simple command line tool is XML Shell:

     

    http://www.xmlsh.org/CommandXPath

     

    You can also find the same functionality in Python's scrapy, but it is overkill for your actual needs.

     

    Regards,

    Sebastian

Sign In or Register to comment.