Reading values using XPATH and extracting from metadata to an attribute

EMDuellEMDuell Member Posts: 21 Contributor I
edited November 2018 in Help
Hello,

This seems like it should be possible but I've hit a few bumps in the road and am hoping that someone can offer a few suggestions.  The basic storyline is that I am attempting to mine some data off of a page that I access in Google.  In order to do this, you have to first log into your Google account.  Here are the steps:

1) Access Google's login page, allowing Google to set a cookie for the session
2) Read hidden variables on the authentication form (the GALX token is what I'm interested in here)
3) Post values back to the form that include the tokens you picked up along with your username and password
4) Voila - you are authenticated

My process to parse the initial query result doesn't seem to be working...RapidMiner does not seem to be picking up the GALX attribute.  So that's the first place I'm stuck.  The second is that once I have that in my metadata, how do I get it out to use in the post back?

Thanks in advance for your help.  Process XML is below.

-Eric

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:get_webpage" compatibility="5.3.001" expanded="true" height="60" name="Get Page" width="90" x="45" y="75">
        <parameter key="url" value="https://accounts.google.com/ServiceLogin?hl=en&amp;continue=https://www.google.com/"/>
        <parameter key="user_agent" value="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1"/>
        <parameter key="accept_cookies" value="all"/>
        <list key="query_parameters"/>
        <list key="request_properties"/>
      </operator>
      <operator activated="true" class="text:extract_information" compatibility="5.3.002" expanded="true" height="60" name="Extract Information" width="90" x="179" y="75">
        <parameter key="query_type" value="XPath"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries"/>
        <list key="regular_region_queries"/>
        <list key="xpath_queries">
          <parameter key="GALX" value="//input[@name='GALX']/@value"/&gt;
        </list>
        <list key="namespaces"/>
        <list key="index_queries"/>
      </operator>
      <connect from_op="Get Page" from_port="output" to_op="Extract Information" to_port="document"/>
      <connect from_op="Extract Information" from_port="document" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>


Answers

  • EMDuellEMDuell Member Posts: 21 Contributor I
    Hi all,

    Has anyone else worked through the details of authenticating your credentials on Google through the operators in RapidMiner?

    -Eric
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 564   Unicorn
    Whilst I'm sure there are better ways, this process gets the GALX token.
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.015">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="web:get_webpage" compatibility="5.3.001" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
            <parameter key="url" value="https://accounts.google.com/ServiceLogin?hl=en&amp;continue=https://www.google.com/"/>
            <parameter key="user_agent" value="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1"/>
            <parameter key="accept_cookies" value="all"/>
            <list key="query_parameters"/>
            <list key="request_properties"/>
          </operator>
          <operator activated="true" class="text:html_to_xml" compatibility="5.3.002" expanded="true" height="60" name="Html To Xml" width="90" x="112" y="120"/>
          <operator activated="true" class="text:write_document" compatibility="5.3.002" expanded="true" height="76" name="Write Document" width="90" x="246" y="120">
            <parameter key="file" value="C:\test.xml"/>
            <parameter key="encoding" value="UTF-8"/>
          </operator>
          <operator activated="true" class="read_xml" compatibility="5.3.015" expanded="true" height="60" name="Read XML" width="90" x="179" y="30">
            <parameter key="xpath_for_examples" value="//html:html/html:body/html:div/html:div/html:div/html:form/html:input"/>
            <enumeration key="xpaths_for_attributes">
              <parameter key="xpath_for_attribute" value="attribute::name"/>
              <parameter key="xpath_for_attribute" value="attribute::value"/>
            </enumeration>
            <list key="namespaces">
              <parameter key="html" value="http://www.w3.org/1999/xhtml"/>
            </list>
            <parameter key="use_default_namespace" value="false"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="Token.true.polynominal.attribute"/>
              <parameter key="1" value="Value.true.polynominal.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="5.3.015" expanded="true" height="76" name="Filter Examples" width="90" x="313" y="30">
            <parameter key="condition_class" value="attribute_value_filter"/>
            <parameter key="parameter_string" value="Token = GALX"/>
          </operator>
          <connect from_op="Get Page" from_port="output" to_op="Html To Xml" to_port="document"/>
          <connect from_op="Html To Xml" from_port="document" to_op="Write Document" to_port="document"/>
          <connect from_op="Write Document" from_port="file" to_op="Read XML" to_port="file"/>
          <connect from_op="Read XML" from_port="output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
Sign In or Register to comment.