Read Excel Table with 300+ URLs and get Page Informations

Naveen_VimalanNaveen_Vimalan Member Posts: 3 Newbie
I would like to get Informations such as the Response Code, Response Message, Content Type etc. of the URLs in my Excel Table. I used - Read Excel -> Store -> Handle Exception (Get Pages) -> Store - as my Process Chain. For some reason I only get the URL as my Result instead of all the Information I want. Hopefully someone can help out.

This is the Code:
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.9.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="9.9.000" expanded="true" height="68" name="Read Excel" width="90" x="112" y="136">
        <parameter key="excel_file" value="/Users/XXX/datamining/excel/Leuphana.xlsx"/>
        <parameter key="sheet_selection" value="sheet number"/>
        <parameter key="sheet_number" value="1"/>
        <parameter key="imported_cell_range" value="A1"/>
        <parameter key="encoding" value="SYSTEM"/>
        <parameter key="first_row_as_names" value="true"/>
        <list key="annotations"/>
        <parameter key="date_format" value=""/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="German (Germany)"/>
        <parameter key="read_all_values_as_polynominal" value="false"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="Links.true.file_path.attribute"/>
        </list>
        <parameter key="read_not_matching_values_as_missings" value="false"/>
      </operator>
      <operator activated="true" class="store" compatibility="9.9.000" expanded="true" height="68" name="Store" width="90" x="246" y="136">
        <parameter key="repository_entry" value="../data/Leuphana_Links"/>
      </operator>
      <operator activated="true" class="handle_exception" compatibility="9.9.000" expanded="true" height="82" name="Handle Exception" width="90" x="380" y="136">
        <parameter key="add_details_to_log" value="false"/>
        <process expanded="true">
          <operator activated="true" class="web:retrieve_webpages" compatibility="9.7.000" expanded="true" height="68" name="Get Pages" width="90" x="179" y="34">
            <parameter key="link_attribute" value="Links"/>
            <parameter key="page_attribute" value="Inhalt"/>
            <parameter key="random_user_agent" value="true"/>
            <parameter key="connection_timeout" value="10000"/>
            <parameter key="read_timeout" value="10000"/>
            <parameter key="follow_redirects" value="true"/>
            <parameter key="accept_cookies" value="all"/>
            <parameter key="cookie_scope" value="thread"/>
            <parameter key="request_method" value="POST"/>
            <parameter key="delay" value="none"/>
            <parameter key="delay_amount" value="1000"/>
            <parameter key="min_delay_amount" value="0"/>
            <parameter key="max_delay_amount" value="1000"/>
          </operator>
          <connect from_port="in 1" to_op="Get Pages" to_port="Example Set"/>
          <connect from_op="Get Pages" from_port="Example Set" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <process expanded="true">
          <connect from_port="in 1" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="store" compatibility="9.9.000" expanded="true" height="68" name="Store (2)" width="90" x="514" y="136">
        <parameter key="repository_entry" value="../data/Leuphana_Result"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Store" to_port="input"/>
      <connect from_op="Store" from_port="through" to_op="Handle Exception" to_port="in 1"/>
      <connect from_op="Handle Exception" from_port="out 1" to_op="Store (2)" to_port="input"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process> 

Best Answer

Answers

  • ceaperezceaperez Member Posts: 333 Unicorn
    edited April 2021
    Hi @Naveen_Vimalan,

    Please read this interesting thread about the Web Page Operator

    network connection with Get Pages - operator — RapidMiner Community

    I attached a simple process to handle with, please try with it, 

    Best


  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 360 RM Data Scientist
    edited April 2021
    Hi @Naveen_Vimalan

    The process you posted is broken. Are you able to attach the excel file or process file (.rmp)? I have built some web scraping and web mining process to get reviews from indeed, yelp, G2, etc.. Attached is the one used for storing the HTML web pages as the first step.

    HTH!
    YY

    dang
  • Naveen_VimalanNaveen_Vimalan Member Posts: 3 Newbie
    Hi @yyhuang,
    I attached the excel and .rmp file down below. I also added a picture of the results I want to achieve with the 300+ URLs instead of only 4 results as shown in the screenshot.

    Best Regards,
    Naveen
  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 360 RM Data Scientist
    Are you able to import the process I shared? @Naveen_Vimalan

    Generally, get page works better than "get pages".
  • Naveen_VimalanNaveen_Vimalan Member Posts: 3 Newbie
    Ok thanks for the answer, but is it possible to use the Operator Read Excel or where do I put in my Excel file in the process you have send me.
Sign In or Register to comment.