Extracting the most representative 10 keywords from web page

singing_bird_1singing_bird_1 Member Posts: 16 Contributor I
edited December 2018 in Help

Hi all,

I am new in rapid miner 

I want to know how can i extract the most 10 representative keywords from a web page

Is there a node that can do this? if no, then tell me how can i do this

I want to give  a URL of  web page as an input and get the 10 representative keywords of that web page as output

thanks in advance

Tagged:

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    You're going to user the Get Page operator, do some HTML cleaning with another operator, then put it into a Text Processing routine. I'm running out the door but do take a look through the Community for some XML examples.

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    As @Thomas_Ott suggests, this is definitely possible, but it will require a series of operators.  Working with text from web pages can be quite tricky because of all the extra html and formatting. 

    It also depends on what you mean by "10 most representative" words.  Many times, the most frequent words are not necessarily the words that capture the main topic of the page.  So even after you have done text processing and have a word vector, you need to think about what exactly your definition of "most representative" might mean.  Different ways of calculating the word vector can help with that: TF-IDF vs term frequency, for example.

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Might I suggest using this process from my Tutorial page here: http://www.neuralmarkettrends.com/use-rapidminer-discover-twitter-content as a starting point. TTYL!

  • singing_bird_1singing_bird_1 Member Posts: 16 Contributor I

    I mean by " 10 most representative keywords" is that from all the extracted keywords from the page, I want only 10 keywords that best describe the content or the context of the page

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    yes I agree with @Telcontar120 - I would learn how to use the Text Processing Extension so you can tokenize and create word vectors, etc...

     

    Scott

  • singing_bird_1singing_bird_1 Member Posts: 16 Contributor I

    thanks all for your replies

    I am doing preprocessing now for the web pages

    first I filtered the html tags then i will start preprocessing

    I have a question please. I am in the first step or removing the html tags.

    I included 9 URLs in a csv file to be processed, but after removing the html tags I get a paragraph of only one URL or only one web page not the 9 web pages.

    how can I get the text after removing the html tags for more than one url?

  • singing_bird_1singing_bird_1 Member Posts: 16 Contributor I

    here is the XML for my process

    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="open_file" compatibility="7.5.003" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
    <parameter key="filename" value="C:\Users\Mennatollah\Desktop\url_test_test.csv"/>
    </operator>
    <operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
    <parameter key="use_quotes" value="false"/>
    <parameter key="parse_numbers" value="false"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations"/>
    <list key="data_set_meta_data_information"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.5.003" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="187"/>
    <operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="447" y="34">
    <parameter key="link_attribute" value="att1"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply" width="90" x="380" y="289"/>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="581" y="34">
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="by ranking"/>
    <parameter key="prune_below_rank" value="0.009"/>
    <parameter key="prune_above_rank" value="0.095"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34">
    <parameter key="ignore_non_html_tags" value="false"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply (2)" width="90" x="448" y="44"/>
    <connect from_port="document" to_op="Extract Content" to_port="document"/>
    <connect from_op="Extract Content" from_port="document" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_port="document 2"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    <portSpacing port="sink_document 3" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
    <connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Get Pages" to_port="Example Set"/>
    <connect from_op="Get Pages" from_port="Example Set" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_port="result 2"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>
     

    123.png 29.2K
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @singing_bird_1 - I'm glad you're making progress. Can you please re-post your XML inside the </> tool so that we can copy and paste it ourselves into RapidMiner?

     

    Thanks.


    Scott

     

  • singing_bird_1singing_bird_1 Member Posts: 16 Contributor I

    attached the xml code

    thank you

  • singing_bird_1singing_bird_1 Member Posts: 16 Contributor I
    <?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="open_file" compatibility="7.5.003" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
    <parameter key="filename" value="C:\Users\Mennatollah\Desktop\url_test_test.csv"/>
    </operator>
    <operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
    <parameter key="use_quotes" value="false"/>
    <parameter key="parse_numbers" value="false"/>
    <parameter key="first_row_as_names" value="false"/>
    <list key="annotations"/>
    <list key="data_set_meta_data_information"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.5.003" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="187"/>
    <operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="447" y="34">
    <parameter key="link_attribute" value="att1"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply" width="90" x="380" y="289"/>
    <operator activated="true" class="textSmiley Tonguerocess_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="581" y="34">
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="by ranking"/>
    <parameter key="prune_below_rank" value="0.009"/>
    <parameter key="prune_above_rank" value="0.095"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34">
    <parameter key="ignore_non_html_tags" value="false"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply (2)" width="90" x="448" y="44"/>
    <connect from_port="document" to_op="Extract Content" to_port="document"/>
    <connect from_op="Extract Content" from_port="document" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_port="document 2"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    <portSpacing port="sink_document 3" spacing="0"/>
    </process>
    </operator>
    <connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
    <connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Get Pages" to_port="Example Set"/>
    <connect from_op="Get Pages" from_port="Example Set" to_op="Multiply" to_port="input"/>
    <connect from_op="Multiply" from_port="output 1" to_port="result 2"/>
    <connect from_op="Multiply" from_port="output 2" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

    here is the xml code

    thank you

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @singing_bird_1 ok we're making some progress.  Thank you for pasting your XML.  It seems that you are running RM 7.5 which is an old version.  Some of your operators were updated in 7.6 and you have pasted things like 

    "textSmiley Tonguerocess_document_from_data"

    in your XML which does not work well.  :)  Can you please try updating RapidMiner to 7.6, opening your process, going to the XML tab, copying exactly what is there, and pasting it here again in this thread?

     

    Scott

     

Sign In or Register to comment.