RapidMiner

‎11-01-2017 01:38 PM
Screen Shot 2017-11-01 at 1.35.57 PM.png

 Hello RapidMiners -

 

So today I had the task to extract and organize content from a Google Scholar query.  Google does a very good job preventing you from scraping/crawling so you have to start "old school" by going to each page of your search and saving the html as a text file.  Once you do that, you can clean it all up and organize, etc...  I did a search for the keyword "rapidminer" (of course), saved the first 100 pages (tedious but not too bad), and then used the attached process to clean it all up.  Maybe some of you will find this useful?

 

Scott

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Comments
Learner III puserc
Learner III

Would you please give us the xml version of this model ?

I found some problems to run it in Rapidminer 8.2.001


@sgenzer wrote:
Screen Shot 2017-11-01 at 1.35.57 PM.png

 Hello RapidMiners -

 

So today I had the task to extract and organize content from a Google Scholar query.  Google does a very good job preventing you from scraping/crawling so you have to start "old school" by going to each page of your search and saving the html as a text file.  Once you do that, you can clean it all up and organize, etc...  I did a search for the keyword "rapidminer" (of course), saved the first 100 pages (tedious but not too bad), and then used the attached process to clean it all up.  Maybe some of you will find this useful?

 

Scott

 


 

Community Manager Community Manager
Community Manager

hi @puserc - the XML is there in the attachment to the article. An ".rmp" file in RapidMiner is exactly the same as the XML you see. Smiley Happy

Learner III puserc
Learner III

I know, the problem is that I couldn't run directly, there are some issues for some nodes. That's why I've asked for the XML version.

Community Manager Community Manager
Community Manager

just open that .rmp in any text editor - copy and paste the XML into RapidMiner XML panel. That should do the trick.