Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Downloading PDFs
Hi, I'm using the "Crawl Web" process to download PDF documents on a Windows 7 Pro machine, using Version 5.3.008 of Rapidminer. Is there a way of getting Rapidminer to download the documents in question without modifying them? The resulting files that I am getting are corrupted in two or more different ways.
When I try to download a PDF document directly, I get the following message :
"There was an error opening this document. The file is damaged and could not be repaired."
When I try to download a document that is accessed via a link such as ...download.php?id=..., I can open the resulting document, but it looks like multiple empty pages.
Investigating these two types of files in Notepad suggests that the latter version is much closer to being the correct format, which is ironic in a sense since the pathname doesn't include the PDF name in that case.
I have left the Encoding settings as the SYSTEM default, although I have tried one or two alternative settings to no avail.
Can anyone help?
Thanks!
When I try to download a PDF document directly, I get the following message :
"There was an error opening this document. The file is damaged and could not be repaired."
When I try to download a document that is accessed via a link such as ...download.php?id=..., I can open the resulting document, but it looks like multiple empty pages.
Investigating these two types of files in Notepad suggests that the latter version is much closer to being the correct format, which is ironic in a sense since the pathname doesn't include the PDF name in that case.
I have left the Encoding settings as the SYSTEM default, although I have tried one or two alternative settings to no avail.
Can anyone help?
Thanks!
Tagged:
0
Answers
Best regards,
Marius
Sorry for the confusion. I simply meant that if RapidMiner is trying to download a pdf via a direct URL, such as :
www.website.com/folder1/otherfolder/filename.pdf
Downloading the pdfs manually via right-click options works fine. I can also do it via another WGet application. It is just when trying to get RapidMiner to download the documents that I get the problems mentioned.