Extracting the most representative 10 keywords from web page

singing_bird_1 · September 2017

Hi all,

I am new in rapid miner

I want to know how can i extract the most 10 representative keywords from a web page

Is there a node that can do this? if no, then tell me how can i do this

I want to give a URL of web page as an input and get the 10 representative keywords of that web page as output

thanks in advance

Thomas_Ott · September 2017

You're going to user the Get Page operator, do some HTML cleaning with another operator, then put it into a Text Processing routine. I'm running out the door but do take a look through the Community for some XML examples.

Telcontar120 · September 2017

As @Thomas_Ott suggests, this is definitely possible, but it will require a series of operators. Working with text from web pages can be quite tricky because of all the extra html and formatting.

It also depends on what you mean by "10 most representative" words. Many times, the most frequent words are not necessarily the words that capture the main topic of the page. So even after you have done text processing and have a word vector, you need to think about what exactly your definition of "most representative" might mean. Different ways of calculating the word vector can help with that: TF-IDF vs term frequency, for example.

Thomas_Ott · September 2017

Might I suggest using this process from my Tutorial page here: http://www.neuralmarkettrends.com/use-rapidminer-discover-twitter-content as a starting point. TTYL!

singing_bird_1 · September 2017

I mean by " 10 most representative keywords" is that from all the extracted keywords from the page, I want only 10 keywords that best describe the content or the context of the page

sgenzer · September 2017

yes I agree with @Telcontar120 - I would learn how to use the Text Processing Extension so you can tokenize and create word vectors, etc...

Scott

singing_bird_1 · October 2017

thanks all for your replies

I am doing preprocessing now for the web pages

first I filtered the html tags then i will start preprocessing

I have a question please. I am in the first step or removing the html tags.

I included 9 URLs in a csv file to be processed, but after removing the html tags I get a paragraph of only one URL or only one web page not the 9 web pages.

how can I get the text after removing the html tags for more than one url?

singing_bird_1 · October 2017

here is the XML for my process

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="open_file" compatibility="7.5.003" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
<parameter key="filename" value="C:\Users\Mennatollah\Desktop\url_test_test.csv"/>
</operator>
<operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
<parameter key="use_quotes" value="false"/>
<parameter key="parse_numbers" value="false"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.5.003" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="187"/>
<operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="447" y="34">
<parameter key="link_attribute" value="att1"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply" width="90" x="380" y="289"/>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="581" y="34">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="by ranking"/>
<parameter key="prune_below_rank" value="0.009"/>
<parameter key="prune_above_rank" value="0.095"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34">
<parameter key="ignore_non_html_tags" value="false"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply (2)" width="90" x="448" y="44"/>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Multiply (2)" to_port="input"/>
<connect from_op="Multiply (2)" from_port="output 1" to_port="document 2"/>
<connect from_op="Multiply (2)" from_port="output 2" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<portSpacing port="sink_document 3" spacing="0"/>
</process>
</operator>
<connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
<connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Get Pages" to_port="Example Set"/>
<connect from_op="Get Pages" from_port="Example Set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_port="result 2"/>
<connect from_op="Multiply" from_port="output 2" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

sgenzer · October 2017

hello @singing_bird_1 - I'm glad you're making progress. Can you please re-post your XML inside the </> tool so that we can copy and paste it ourselves into RapidMiner?

Thanks.

Scott

singing_bird_1 · October 2017

attached the xml code

thank you

singing_bird_1 · October 2017

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="open_file" compatibility="7.5.003" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
<parameter key="filename" value="C:\Users\Mennatollah\Desktop\url_test_test.csv"/>
</operator>
<operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
<parameter key="use_quotes" value="false"/>
<parameter key="parse_numbers" value="false"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.5.003" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="187"/>
<operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="447" y="34">
<parameter key="link_attribute" value="att1"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply" width="90" x="380" y="289"/>
<operator activated="true" class="textSmiley Tonguerocess_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="581" y="34">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="by ranking"/>
<parameter key="prune_below_rank" value="0.009"/>
<parameter key="prune_above_rank" value="0.095"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34">
<parameter key="ignore_non_html_tags" value="false"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply (2)" width="90" x="448" y="44"/>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Multiply (2)" to_port="input"/>
<connect from_op="Multiply (2)" from_port="output 1" to_port="document 2"/>
<connect from_op="Multiply (2)" from_port="output 2" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<portSpacing port="sink_document 3" spacing="0"/>
</process>
</operator>
<connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
<connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Get Pages" to_port="Example Set"/>
<connect from_op="Get Pages" from_port="Example Set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_port="result 2"/>
<connect from_op="Multiply" from_port="output 2" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

here is the xml code

thank you

sgenzer · October 2017

hello @singing_bird_1 ok we're making some progress. Thank you for pasting your XML. It seems that you are running RM 7.5 which is an old version. Some of your operators were updated in 7.6 and you have pasted things like

"textSmiley Tonguerocess_document_from_data"

in your XML which does not work well. Can you please try updating RapidMiner to 7.6, opening your process, going to the XML tab, copying exactly what is there, and pasting it here again in this thread?

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Extracting the most representative 10 keywords from web page

Answers