Google Scholar Citation Extraction

sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
edited December 2018 in Knowledge Base
Screen Shot 2017-11-01 at 1.35.57 PM.png

 Hello RapidMiners -

 

So today I had the task to extract and organize content from a Google Scholar query.  Google does a very good job preventing you from scraping/crawling so you have to start "old school" by going to each page of your search and saving the html as a text file.  Once you do that, you can clean it all up and organize, etc...  I did a search for the keyword "rapidminer" (of course), saved the first 100 pages (tedious but not too bad), and then used the attached process to clean it all up.  Maybe some of you will find this useful?

 

Scott

 

Tagged:

Comments

  • pusercpuserc Member Posts: 6 Contributor I

    Would you please give us the xml version of this model ?

    I found some problems to run it in Rapidminer 8.2.001


    @sgenzer wrote:

    Screen Shot 2017-11-01 at 1.35.57 PM.png

     Hello RapidMiners -

     

    So today I had the task to extract and organize content from a Google Scholar query.  Google does a very good job preventing you from scraping/crawling so you have to start "old school" by going to each page of your search and saving the html as a text file.  Once you do that, you can clean it all up and organize, etc...  I did a search for the keyword "rapidminer" (of course), saved the first 100 pages (tedious but not too bad), and then used the attached process to clean it all up.  Maybe some of you will find this useful?

     

    Scott

     



     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @puserc - the XML is there in the attachment to the article. An ".rmp" file in RapidMiner is exactly the same as the XML you see. :)

  • pusercpuserc Member Posts: 6 Contributor I

    I know, the problem is that I couldn't run directly, there are some issues for some nodes. That's why I've asked for the XML version.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    just open that .rmp in any text editor - copy and paste the XML into RapidMiner XML panel. That should do the trick.

  • 1931607119316071 Member Posts: 1 Contributor I

    Hi @sgenzer

    I am a new learner of RapidMiner and have the same task. I want to extract the Google Citations. I have run through the tutorial of RapidMiner for a bigenner level learning. Can you please explain me a little more for a head start that how have you built the process. It will be a great help for me.

     

    I am also keen to learn the text mining in depth on RapidMiner for extracting information from published research articles. Can you or anyone else pleae also advise me some good learning resources?

     

    Thanks in anticipation

    Mudassar

    19316071@student.westernsydney.edu.au

Sign In or Register to comment.