RapidMiner

RapidMiner

Downloading PDFs

Contributor II

Downloading PDFs

Hi, I'm using the "Crawl Web" process to download PDF documents on a Windows 7 Pro machine, using Version 5.3.008 of Rapidminer.  Is there a way of getting Rapidminer to download the documents in question without modifying them?  The resulting files that I am getting are corrupted in two or more different ways.

When I try to download a PDF document directly, I get the following message :
"There was an error opening this document. The file is damaged and could not be repaired."

When I try to download a document that is accessed via a link such as ...download.php?id=..., I can open the resulting document, but it looks like multiple empty pages.

Investigating these two types of files in Notepad suggests that the latter version is much closer to being the correct format, which is ironic in a sense since the pathname doesn't include the PDF name in that case.

I have left the Encoding settings as the SYSTEM default, although I have tried one or two alternative settings to no avail.

Can anyone help?

Thanks!
2 REPLIES
Super Contributor

Re: Downloading PDFs

What do you mean by downloading it directly? You mean from the browser? Then probably the file is corrupted on the server, and RapidMiner has no chance of get it correct. If I misunderstood you, please let me know.

Best regards,
Marius
Contributor II

Re: Downloading PDFs

Hi Marius, thanks for the reply.

Sorry for the confusion.  I simply meant that if RapidMiner is trying to download a pdf via a direct URL, such as :
www.website.com/folder1/otherfolder/filename.pdf

Downloading the pdfs manually via right-click options works fine.  I can also do it via another WGet application.  It is just when trying to get RapidMiner to download the documents that I get the problems mentioned.