‎11-01-2017 01:38 PM
Screen Shot 2017-11-01 at 1.35.57 PM.png

 Hello RapidMiners -


So today I had the task to extract and organize content from a Google Scholar query.  Google does a very good job preventing you from scraping/crawling so you have to start "old school" by going to each page of your search and saving the html as a text file.  Once you do that, you can clean it all up and organize, etc...  I did a search for the keyword "rapidminer" (of course), saved the first 100 pages (tedious but not too bad), and then used the attached process to clean it all up.  Maybe some of you will find this useful?




Scott Genzer
Senior Community Manager
RapidMiner, Inc.