The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

Web crawling on google page

Juju147Juju147 Member Posts: 5 Contributor II
edited November 2018 in Help
Hi everyone,

I have a question about the operator web crawling.

I am trying to use it on a google research page but unfortunatly, I cannot reach the link provide by the research.

For example, my google page is : https://www.google.fr/search?q=F&oq=f&aqs=chrome.4.69i60l3j69i59l3.2352j0j8&sourceid=chrome&espv=210&es_sm=93&ie=UTF-8

This my process :

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.013">
 <context>
   <input/>
   <output/>
   <macros/>
 </context>
 <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
   <process expanded="true">
     <operator activated="true" class="web:crawl_web" compatibility="5.3.001" expanded="true" height="60" name="Crawl Web" width="90" x="45" y="30">
       <parameter key="url" value="https://www.google.fr/search?q=F&amp;oq=f&amp;aqs=chrome.4.69i60l3j69i59l3.2352j0j8&amp;sourceid=chrome&amp;espv=210&amp;es_sm=93&amp;ie=UTF-8"/>
       <list key="crawling_rules">
         <parameter key="store_with_matching_url" value=".+facebook.+"/>
         <parameter key="follow_link_with_matching_url" value=".+facebook.+"/>
       </list>
       <parameter key="output_dir" value="C:\Users\Julien\Desktop\S5\WEBMINING"/>
       <parameter key="max_pages" value="100"/>
       <parameter key="max_depth" value="1"/>
       <parameter key="delay" value="500"/>
       <parameter key="max_threads" value="10"/>
     </operator>
     <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
     <portSpacing port="source_input 1" spacing="0"/>
     <portSpacing port="sink_result 1" spacing="0"/>
     <portSpacing port="sink_result 2" spacing="0"/>
   </process>
 </operator>
</process>


I am trying to reach the facebook link but it doesn't work.

Can you help me ?

Sincerly,

Ju
Sign In or Register to comment.