"Scrape a website and download hyperlinked pdf files"

gary_molloy · October 2017

I can scrape in python, but how do download and store hyperlinked pdf or other files in their native format using RapidMiner?

Telcontar120 · October 2017

Is the "Open File" operator not doing what you want? It allows you to get files from any URL or file path and have them as a file object, which can then be stored. If you have multiple files then you can use macros and put this in a loop.

If you want to scrape actual web pages, then use "Get Page" or "Get Pages" instead.

sgenzer · October 2017

hello @gary_molloy - if you use the "Crawl Web" operator (Web Mining extension), there is an option to "write pages to disk". This will save the PDFs like normal. I have done this many times.

Scott

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Scrape a website and download hyperlinked pdf files"

Answers