CrawlWeb a news site for a specific keyword

ittaj_goldbergeittaj_goldberge Member Posts: 6 Contributor I
edited November 2018 in Help

 

Hi everyone

I am new here! i have a problem with crawlweb which i'm not able to solve, i tried and googled for weeks now.. (anyway it seems pretty simple but I just dont get it..)

 

I want to crawl a newssite (here: http://www.bbc.com/) for a keyword (here: .*zuckerberg.*) and save 100 results in .txt

But it just doesn't work, i tried everything but i don't seem to get it done.

 

I hope you can help me, please see my process in .xml.

Thank you very much for your help in advance!

<?xml version="1.0" encoding="UTF-8"?>

-<process version="8.2.000">


-<context>

<input/>

<output/>

<macros/>

</context>


-<operator name="Process" expanded="true" compatibility="8.2.000" class="process" activated="true">

<parameter value="init" key="logverbosity"/>

<parameter value="2001" key="random_seed"/>

<parameter value="never" key="send_mail"/>

<parameter value="" key="notification_email"/>

<parameter value="30" key="process_duration_for_mail"/>

<parameter value="SYSTEM" key="encoding"/>


-<process expanded="true">


-<operator name="Crawl Web" expanded="true" compatibility="7.3.000" class="web:crawl_web" activated="true" y="34" x="112" width="90" height="68">

<parameter value="http://www.bbc.com/" key="url"/>


-<list key="crawling_rules">

<parameter value=".*tech.*" key="follow_link_with_matching_url"/>

<parameter value=".*zuckerberg.*" key="store_with_matching_url"/>

<parameter value=".*news.*" key="follow_link_with_matching_url"/>

<parameter value=".*zuckerberg.*" key="store_with_matching_content"/>

</list>

<parameter value="false" key="write_pages_into_files"/>

<parameter value="true" key="add_pages_as_attribute"/>

<parameter value="txt" key="extension"/>

<parameter value="100" key="max_pages"/>

<parameter value="4" key="max_depth"/>

<parameter value="web" key="domain"/>

<parameter value="1000" key="delay"/>

<parameter value="2" key="max_threads"/>

<parameter value="10000" key="max_page_size"/>

<parameter value="rapid-miner-crawler" key="user_agent"/>

<parameter value="true" key="obey_robot_exclusion"/>

<parameter value="false" key="really_ignore_exclusion"/>

</operator>


-<operator name="Process Documents from Data" expanded="true" compatibility="8.1.000" class="text:process_document_from_data" activated="true" y="34" x="313" width="90" height="82">

<parameter value="false" key="create_word_vector"/>

<parameter value="TF-IDF" key="vector_creation"/>

<parameter value="true" key="add_meta_information"/>

<parameter value="true" key="keep_text"/>

<parameter value="none" key="prune_method"/>

<parameter value="3.0" key="prune_below_percent"/>

<parameter value="30.0" key="prune_above_percent"/>

<parameter value="0.05" key="prune_below_rank"/>

<parameter value="0.95" key="prune_above_rank"/>

<parameter value="double_sparse_array" key="datamanagement"/>

<parameter value="auto" key="data_management"/>

<parameter value="false" key="select_attributes_and_weights"/>

<list key="specify_weights"/>


-<process expanded="true">


-<operator name="Extract Content" expanded="true" compatibility="7.3.000" class="web:extract_html_text_content" activated="true" y="34" x="45" width="90" height="68">

<parameter value="true" key="extract_content"/>

<parameter value="5" key="minimum_text_block_length"/>

<parameter value="true" key="override_content_type_information"/>

<parameter value="true" key="neglegt_span_tags"/>

<parameter value="true" key="neglect_p_tags"/>

<parameter value="true" key="neglect_b_tags"/>

<parameter value="true" key="neglect_i_tags"/>

<parameter value="true" key="neglect_br_tags"/>

<parameter value="true" key="ignore_non_html_tags"/>

</operator>

<operator name="Unescape HTML Document" expanded="true" compatibility="7.3.000" class="web:unescape_html" activated="true" y="34" x="179" width="90" height="68"/>


-<operator name="Write Document" expanded="true" compatibility="8.1.000" class="text:write_document" activated="true" y="34" x="313" width="90" height="82">

<parameter value="true" key="overwrite"/>

<parameter value="SYSTEM" key="encoding"/>

</operator>


-<operator name="Write File" expanded="true" compatibility="8.2.000" class="write_file" activated="true" y="136" x="447" width="90" height="68">

<parameter value="file" key="resource_type"/>

<parameter value="C:\Users\Ittaj\Desktop\rapidminer\tests\%{t}-%{a}.txt" key="filename"/>

<parameter value="application/octet-stream" key="mime_type"/>

</operator>

<connect to_port="document" to_op="Extract Content" from_port="document"/>

<connect to_port="document" to_op="Unescape HTML Document" from_port="document" from_op="Extract Content"/>

<connect to_port="document" to_op="Write Document" from_port="document" from_op="Unescape HTML Document"/>

<connect to_port="document 1" from_port="document" from_op="Write Document"/>

<connect to_port="file" to_op="Write File" from_port="file" from_op="Write Document"/>

<portSpacing spacing="0" port="source_document"/>

<portSpacing spacing="0" port="sink_document 1"/>

<portSpacing spacing="0" port="sink_document 2"/>

</process>

</operator>

<connect to_port="example set" to_op="Process Documents from Data" from_port="Example Set" from_op="Crawl Web"/>

<connect to_port="result 1" from_port="example set" from_op="Process Documents from Data"/>

<portSpacing spacing="0" port="source_input 1"/>

<portSpacing spacing="0" port="sink_result 1"/>

<portSpacing spacing="0" port="sink_result 2"/>

</process>

</operator>

</process>

 

Tagged:

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hmm I think your XML code is broken. Can you please just go to the XML panel and "copy and paste" into this thread?

  • ittaj_goldbergeittaj_goldberge Member Posts: 6 Contributor I

    thanks, i try it again:

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="web:crawl_web" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="34">
    <parameter key="url" value="http://www.bbc.com/"/>
    <list key="crawling_rules">
    <parameter key="follow_link_with_matching_url" value=".*tech.*"/>
    <parameter key="follow_link_with_matching_url" value=".*news.*"/>
    <parameter key="store_with_matching_url" value=".*zuckerberg.*"/>
    <parameter key="store_with_matching_content" value=".*zuckerberg.*"/>
    </list>
    <parameter key="write_pages_into_files" value="false"/>
    <parameter key="add_pages_as_attribute" value="true"/>
    <parameter key="max_pages" value="100"/>
    <parameter key="max_depth" value="4"/>
    <parameter key="max_threads" value="2"/>
    <parameter key="max_page_size" value="10000"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="34">
    <parameter key="create_word_vector" value="false"/>
    <parameter key="keep_text" value="true"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34"/>
    <operator activated="true" class="web:unescape_html" compatibility="7.3.000" expanded="true" height="68" name="Unescape HTML Document" width="90" x="179" y="34"/>
    <operator activated="true" class="text:write_document" compatibility="8.1.000" expanded="true" height="82" name="Write Document" width="90" x="313" y="34"/>
    <operator activated="true" class="write_file" compatibility="8.2.000" expanded="true" height="68" name="Write File" width="90" x="447" y="136">
    <parameter key="filename" value="C:\Users\Ittaj\Desktop\rapidminer\tests\%{t}-%{a}.txt"/>
    </operator>
    <connect from_port="document" to_op="Extract Content" to_port="document"/>
    <connect from_op="Extract Content" from_port="document" to_op="Unescape HTML Document" to_port="document"/>
    <connect from_op="Unescape HTML Document" from_port="document" to_op="Write Document" to_port="document"/>
    <connect from_op="Write Document" from_port="document" to_port="document 1"/>
    <connect from_op="Write Document" from_port="file" to_op="Write File" to_port="file"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Crawl Web" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @ittaj_goldberge

     

    This type of setting works for me, retrieving artickes with Zuc in those:

     

    Screenshot 2018-05-11 22.57.21.png

     

    When you say "it doesn't work", what exactly do you mean? Does the process hang, or deliver wrong results?  

  • ittaj_goldbergeittaj_goldberge Member Posts: 6 Contributor I

    hi @kypexin

     

    i tried a lot of different variants (in rule application/value, and also depth and links)

    usually the process runs for a second and there are no results. sometimes i got a few results (less than 20, but i need around 100).

     

    I'm trying it right niw with your rules, it runs since 2 minutes, i will update soon.

     

  • ittaj_goldbergeittaj_goldberge Member Posts: 6 Contributor I

    so i tried it again with your rules, and i only got 8 results, with some duplicates.

    any idea how i can crawl a newssite for zuckerberg and get 100 results?

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    @ittaj_goldberge does the news site have more than 8 zuckerberg articles? You might have to change the depth parameter to dig deeper?

  • ittaj_goldbergeittaj_goldberge Member Posts: 6 Contributor I

    hi @Thomas_Ott

    when i go to the search bar on bbc and look for zuckerberg, there are 1000s of results..

    https://www.bbc.co.uk/search?q=zuckerberg#page=5

     

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    @ittaj_goldberge I'm by all means not a web crawling expert but lately for some client work I was exposed to web browser automation. Websites have gotten smart and in order to prevent people from crawling their websites they created various scripts to hide content that wasn't on the first page or 'above the fold.'

     

    I suspect that this is the case. The link you shared was really a search that you used. It required a browser to access and probably doesn't work with a web crawler like RapidMiner. So that could be the problem. 

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    If thois is the case @Thomas_Ott had mentioned, I might also expect that you could probably play around with 'user agent' and 'obey robot' parameters of Crawl Web operator (namely, change user agent string and disable the checkbox and then compare the results):

     

    webcrawl.png

  • MultanTVHDMultanTVHD Member Posts: 1 Newbie
    hi your answer is in this  website
Sign In or Register to comment.