🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS DEADLINE IS NOVEMBER 15   🦉 🎤

CLICK HERE TO GO TO ENTRY FORM

Xpath returning ?

b00122599b00122599 Member Posts: 20 Contributor II
Hey folks,

I an using Xpath for the first time with rapidminer. with the extract information operator however I keep getting "?" as the output for my attribute. I have checked in chrome that the Xpath is correct, and I've tried placing variants of h: /h: //h: in the query expression field at the start of the xpath, however no matter how I edit this field I still get ? as the result for the attribute. 

Any pointers would be much appreciated.

Cheers,

Neil. 
Tagged:

Best Answers

  • b00122599b00122599 Posts: 20 Contributor II
    Solution Accepted
    Hey folks thanks for the kind replies I think I need to go do some more learning and come back with better question. Thanks again!
  • b00122599b00122599 Posts: 20 Contributor II
    Solution Accepted
    Hey folks problem solved, excel had added formatting to my URLs when I was importing the links! All working now! Thanks again for the help!

Answers

  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,581  Community Manager
    hi @b00122599 hmm I think we really need your XML that you're trying to parse, and your RapidMiner process XML.
    ----------------------
    Don't forget to submit your great ideas for Wisdom 2020! Deadline is November 15.

    Wisdom 2020 – Call for Speakers Form 
  • MarcoBarradasMarcoBarradas RapidMiner Certified Analyst, Member Posts: 78   Unicorn
    Hi @b00122599 take a look at this thread.
    http://https//community.rapidminer.com/discussion/14888/xpath-commands-working-in-google-docs-but-not-in-rapidminer
    Next recommended steps would be to take the free courses on the RM academy 
    https://academy.rapidminer.com/

    If you have more questions feel free to ask us. We do have the answer you are searching for but I'm trying to show you the next steps for answering all the questions that will come after you figure out how to pass the "?" the is currently bothering you.

    Best regards! 


    lionelderkrikorTghadially
  • b00122599b00122599 Member Posts: 20 Contributor II
    Hey folks,

    Sorry for reopening but I'm still stuck. I am getting the correct results with Xpath in Google sheets using  "//*@id="centerFrameWhite"]/p[1]/b" on the website https://www.ntfa.net/universe/english/index.php?act=view&char=Afterburner .

    However I have tried this multiple different ways with Rapidminer to no success. Any help is much appreciated I tried to follow the other link above but couldn't get it working. 

    XML is below:


    <?xml version="1.0" encoding="UTF-8"?><process version="9.5.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.5.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="9.5.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="136">
            <parameter key="excel_file" value="D:\OneDrive\College\profilessmall.xlsx"/>
            <parameter key="sheet_selection" value="sheet number"/>
            <parameter key="sheet_number" value="1"/>
            <parameter key="imported_cell_range" value="A1"/>
            <parameter key="encoding" value="SYSTEM"/>
            <parameter key="first_row_as_names" value="true"/>
            <list key="annotations"/>
            <parameter key="date_format" value=""/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="locale" value="English (United States)"/>
            <parameter key="read_all_values_as_polynominal" value="false"/>
            <list key="data_set_meta_data_information"/>
            <parameter key="read_not_matching_values_as_missings" value="true"/>
            <parameter key="datamanagement" value="double_array"/>
            <parameter key="data_management" value="auto"/>
          </operator>
          <operator activated="true" class="web:retrieve_webpages" compatibility="9.0.000" expanded="true" height="68" name="Get Pages" width="90" x="246" y="136">
            <parameter key="link_attribute" value="LINKS"/>
            <parameter key="random_user_agent" value="false"/>
            <parameter key="user_agent" value="googlebot"/>
            <parameter key="connection_timeout" value="10000"/>
            <parameter key="read_timeout" value="10000"/>
            <parameter key="follow_redirects" value="true"/>
            <parameter key="accept_cookies" value="none"/>
            <parameter key="cookie_scope" value="global"/>
            <parameter key="request_method" value="GET"/>
            <parameter key="delay" value="none"/>
            <parameter key="delay_amount" value="1000"/>
            <parameter key="min_delay_amount" value="0"/>
            <parameter key="max_delay_amount" value="1000"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="8.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="136">
            <parameter key="create_word_vector" value="true"/>
            <parameter key="vector_creation" value="TF-IDF"/>
            <parameter key="add_meta_information" value="true"/>
            <parameter key="keep_text" value="false"/>
            <parameter key="prune_method" value="none"/>
            <parameter key="prune_below_percent" value="3.0"/>
            <parameter key="prune_above_percent" value="30.0"/>
            <parameter key="prune_below_rank" value="0.05"/>
            <parameter key="prune_above_rank" value="0.95"/>
            <parameter key="datamanagement" value="double_sparse_array"/>
            <parameter key="data_management" value="auto"/>
            <parameter key="select_attributes_and_weights" value="false"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:extract_information" compatibility="8.2.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="34">
                <parameter key="query_type" value="XPath"/>
                <list key="string_machting_queries"/>
                <parameter key="attribute_type" value="Nominal"/>
                <list key="regular_expression_queries"/>
                <list key="regular_region_queries"/>
                <list key="xpath_queries">
                  <parameter key="Robotname" value="h://*[@id=&;quot;centerFrameWhite&quot;]/h:p[1]/h:b"/>
                </list>
                <list key="namespaces"/>
                <parameter key="ignore_CDATA" value="true"/>
                <parameter key="assume_html" value="true"/>
                <list key="index_queries"/>
                <list key="jsonpath_queries"/>
              </operator>
              <operator activated="true" class="web:extract_html_text_content" compatibility="9.0.000" expanded="true" height="68" name="Extract Content" width="90" x="447" y="34">
                <parameter key="extract_content" value="true"/>
                <parameter key="minimum_text_block_length" value="500"/>
                <parameter key="override_content_type_information" value="true"/>
                <parameter key="neglegt_span_tags" value="true"/>
                <parameter key="neglect_p_tags" value="true"/>
                <parameter key="neglect_b_tags" value="true"/>
                <parameter key="neglect_i_tags" value="true"/>
                <parameter key="neglect_br_tags" value="true"/>
                <parameter key="ignore_non_html_tags" value="true"/>
              </operator>
              <connect from_port="document" to_op="Extract Information" to_port="document"/>
              <connect from_op="Extract Information" from_port="document" to_op="Extract Content" to_port="document"/>
              <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Get Pages" to_port="Example Set"/>
          <connect from_op="Get Pages" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    Tghadially
Sign In or Register to comment.