The Altair Community and the RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.
Options

Need to crawl webpages requiring login details

VineetVineet Member Posts: 16 Contributor II
edited November 2018 in Help
Hello,
i need to crawl certain websites but they require login details to be entered.
I am not able to figure out how to provide my login details in order to get access to my homepage.Is there any operator or any other way to do that??
Please help.!!
Thanks and Regards,
Vineet

Answers

  • Options
    SkirzynskiSkirzynski Member Posts: 164 Maven
    Hey,

    In the Get-Page operator is an option to activate cookies. Activate it and send your credentials (username and password) to the login page as POST parameter. Usually a web page will store your session in a cookie. Further requests of any Get-Page(s) operator will be handled by the website in the same session (using the stored cookies), thus you are logged in (if your credentials are correct) and can fetch the login-secured websites.

    Happy crawling!
      Marcin
  • Options
    VineetVineet Member Posts: 16 Contributor II
    Hello Marcin,
    Appreciate your quick reply. It was the same i was trying to do. But somehow i am not able to do it.
    Here is my code for your reference.
    Please tell me where am i going wrong.

    Thanks and Regards,
    Vineet
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.2.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
        <process expanded="true" height="505" width="614">
          <operator activated="true" class="web:get_webpage" compatibility="5.2.003" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
            <parameter key="url" value="http://www.gmail.com"/>
            <parameter key="user_agent" value="  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.12 Safari/537.4 "/>
            <parameter key="read_timeout" value="1000"/>
            <parameter key="accept_cookies" value="all"/>
            <parameter key="request_method" value="POST"/>
            <list key="query_parameters">
              <parameter key="&amp;Email" value="infospace007@gmail.com"/>
              <parameter key="&amp;Passwd" value="infospace"/>
            </list>
            <list key="request_properties">
              <parameter key="&amp;Email" value="infospace007@gmail.com"/>
              <parameter key="&amp;Passwd" value="infospace"/>
            </list>
          </operator>
          <operator activated="true" class="read_excel" compatibility="5.2.008" expanded="true" height="60" name="Read Excel" width="90" x="45" y="120">
            <parameter key="excel_file" value="F:\Try\AuthLinks.xlsx"/>
            <parameter key="imported_cell_range" value="A2:B3"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="A.true.integer.attribute"/>
              <parameter key="1" value="B.true.binominal.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="web:retrieve_webpages" compatibility="5.2.003" expanded="true" height="60" name="Get Pages" width="90" x="246" y="120">
            <parameter key="link_attribute" value="B"/>
            <parameter key="page_attribute" value="MyPage"/>
            <parameter key="user_agent" value="  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.12 Safari/537.4 "/>
            <parameter key="accept_cookies" value="all"/>
            <parameter key="request_method" value="POST"/>
          </operator>
          <connect from_op="Get Page" from_port="output" to_port="result 1"/>
          <connect from_op="Read Excel" from_port="output" to_op="Get Pages" to_port="Example Set"/>
          <connect from_op="Get Pages" from_port="Example Set" to_port="result 2"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.