Can rapid miner do a automated regular research (say daily) of a list of words in a list of url, and get each page link?
I have a list of words and I want to regularly get every web link where any of these words appears in any of the web url from my predefined urls list.
you can use the Get Pages operator to get the contents of a number of websites whose links you provide in a data table.
You can then use the text processing extension to count the words that appear in the different sites. Our websites provides some links to video tutorials for the text mining extension:
http://rapid-i.com/content/view/189/212/lang,en/ To focus on the contents of the websites and remove all html tags you can use the Extract Content operator.
Finally, to execute the job regularly, you should use the RapidAnalytics server, also available on our website.
Thank you. I'm almost there. But in order to solve this and get the job done, after I extract words with "extract content" as you say, I further need to get a doc. list or a folder with the pages (the url links in a doc., or html pages in a floder, etc.) for every word extracted. How can I do this?
In other words, my job would be to filter a pre-defined list of sites (with the filter being a list of varios words) AND THE RESULT must be to get the specific WEB LINKS to the pages where those words appear the predefined sites.
after the Process Documents operator you should have a table that contains the occurrences of each word (columns) in each document (rows), alongside with the URL of the page in the URL attribute.
Now you can iterate your target words and use Filter Examples to keep only those rows where the column for the current word contains a value greater than zero. Then you can Write the URLs of the matching documents to the harddisk, e.g. with the Write Excel or Write CSV operator.
Does that help? If you have any questions left, please attach the XML of your process such that we can use it as a base for our answer.
Bellow is the precess, as far as I could go. Can I count on you to make it work an finalize this job (actually, and finally get the url list of the pages where the researched words appear in the predefined list of websites)?
just a friendly reminder, this is a community forum where members of the community can help each other out. Sometimes, when time allows, we do chip in and provide answers to some questions. However there is never a guarantee that we will answer in this forum. If you do need support with fixed answering times, please
contact us and inquire about enterprise support.
_________________________________________________________ Team Lead Software Engineering | RapidMiner GmbH