RapidMiner

Extracting the most representative 10 keywords from web page

Contributor II singing_bird_1
Contributor II

Extracting the most representative 10 keywords from web page

Hi all,

I am new in rapid miner 

I want to know how can i extract the most 10 representative keywords from a web page

Is there a node that can do this? if no, then tell me how can i do this

I want to give  a URL of  web page as an input and get the 10 representative keywords of that web page as output

thanks in advance

11 REPLIES
RM Certified Expert
RM Certified Expert

Re: Extracting the most representative 10 keywords from web page

You're going to user the Get Page operator, do some HTML cleaning with another operator, then put it into a Text Processing routine. I'm running out the door but do take a look through the Community for some XML examples.

RM Certified Expert
RM Certified Expert

Re: Extracting the most representative 10 keywords from web page

As @Thomas_Ott suggests, this is definitely possible, but it will require a series of operators.  Working with text from web pages can be quite tricky because of all the extra html and formatting. 

It also depends on what you mean by "10 most representative" words.  Many times, the most frequent words are not necessarily the words that capture the main topic of the page.  So even after you have done text processing and have a word vector, you need to think about what exactly your definition of "most representative" might mean.  Different ways of calculating the word vector can help with that: TF-IDF vs term frequency, for example.

 

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts
RM Certified Expert
RM Certified Expert

Re: Extracting the most representative 10 keywords from web page

Might I suggest using this process from my Tutorial page here: http://www.neuralmarkettrends.com/use-rapidminer-discover-twitter-content as a starting point. TTYL!

Contributor II singing_bird_1
Contributor II

Re: Extracting the most representative 10 keywords from web page

I mean by " 10 most representative keywords" is that from all the extracted keywords from the page, I want only 10 keywords that best describe the content or the context of the page

Community Manager Community Manager
Community Manager

Re: Extracting the most representative 10 keywords from web page

yes I agree with @Telcontar120 - I would learn how to use the Text Processing Extension so you can tokenize and create word vectors, etc...

 

Scott

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Contributor II singing_bird_1
Contributor II

Re: Extracting the most representative 10 keywords from web page

thanks all for your replies

I am doing preprocessing now for the web pages

first I filtered the html tags then i will start preprocessing

I have a question please. I am in the first step or removing the html tags.

I included 9 URLs in a csv file to be processed, but after removing the html tags I get a paragraph of only one URL or only one web page not the 9 web pages.

how can I get the text after removing the html tags for more than one url?

Contributor II singing_bird_1
Contributor II

Re: Extracting the most representative 10 keywords from web page

here is the XML for my process

<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="open_file" compatibility="7.5.003" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
<parameter key="filename" value="C:\Users\Mennatollah\Desktop\url_test_test.csv"/>
</operator>
<operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
<parameter key="use_quotes" value="false"/>
<parameter key="parse_numbers" value="false"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.5.003" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="187"/>
<operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="447" y="34">
<parameter key="link_attribute" value="att1"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply" width="90" x="380" y="289"/>
<operator activated="true" class="textSmiley Tonguerocess_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="581" y="34">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="by ranking"/>
<parameter key="prune_below_rank" value="0.009"/>
<parameter key="prune_above_rank" value="0.095"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34">
<parameter key="ignore_non_html_tags" value="false"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply (2)" width="90" x="448" y="44"/>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Multiply (2)" to_port="input"/>
<connect from_op="Multiply (2)" from_port="output 1" to_port="document 2"/>
<connect from_op="Multiply (2)" from_port="output 2" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<portSpacing port="sink_document 3" spacing="0"/>
</process>
</operator>
<connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
<connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Get Pages" to_port="Example Set"/>
<connect from_op="Get Pages" from_port="Example Set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_port="result 2"/>
<connect from_op="Multiply" from_port="output 2" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
 

Community Manager Community Manager
Community Manager

Re: Extracting the most representative 10 keywords from web page

hello @singing_bird_1 - I'm glad you're making progress. Can you please re-post your XML inside the </> tool so that we can copy and paste it ourselves into RapidMiner?

 

Thanks.


Scott

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Contributor II singing_bird_1
Contributor II

Re: Extracting the most representative 10 keywords from web page

attached the xml code

thank you