The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

"Crawl Web Link-Page pairs are incorrect"

CharlieFirpoCharlieFirpo Member Posts: 48 Contributor II
edited June 2019 in Help
Dear all!

When I use Crawl Web operator and check the "add pages as attribute" parameter then the result will consist of Link-Page pairs (the number of examples depend on "max pages" parameter). But if I check the HTML content of the Link attribute's value (Url from Link attribute (Url 1)) then I see that the real HTML content (of the Link value) is different from the Page attribute's content. How can it be?

If I don't store the Page in Crawl Web but use a Get Pages operator that has the "link attribute" parameter set to Link attribute (from the Crawl Web) and set the "page attribute" parameter to Page, I see that the Link and Page pairs are different too (as in Crawl Web). And when I check the output of Get Pages, I can see an URL attribute too next to the Link and Page attributes (and some more attributes). And the URL attribute contains the real Url (Url 2) belongs to the Page attribute's value. So the HTML content of the URL attribute's value (Url 2) is the same as the Page attribute content (Page's value). But different from Link attribute's value (Url 1).

But I don't understand why the Link attribute and URL attribute are different. And why the Page attribute's values don't belong to Link attribute's values.


  • Options
    CharlieFirpoCharlieFirpo Member Posts: 48 Contributor II

    So the difference between the input Url (the Link attribute's value) and the Page value (HTML content) is because the cookies are not enabled via Crawl Web. But in my web browser, the cookies are enabled, so when I checked out the Url's and its content, this content was different from Crawl Web's Page attribute's value.

    At Get Pages operator it is possible to enable cookies but at Crawl Web there is not a parameter to enable cookies. Or is there?
  • Options
    MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Unfortunately, Crawl Web does not support the option to set cookies. But isn't it sufficient to login with the Get Page operator and then use Crawl Web?

    Best regards,
Sign In or Register to comment.