Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

"Webmining - Devide a Token into separate Values?"

T-UnitT-Unit Member Posts: 12 Contributor II
edited June 2019 in Help
Hello,

I'm trying to create a little webmining process. I want to extract a numeric value (bid price and ask price of a an financial asset) from a website. My process looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.015">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:get_webpage" compatibility="5.3.001" expanded="true" height="60" name="Get Page" width="90" x="45" y="30">
        <parameter key="url" value="http://zertifikat.finanzen.net/optionsscheine/Auf-uu6dzh/UU6DZH"/>
        <parameter key="random_user_agent" value="true"/>
        <parameter key="follow_redirects" value="false"/>
        <list key="query_parameters"/>
        <list key="request_properties"/>
      </operator>
      <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.001" expanded="true" height="60" name="Extract Content (2)" width="90" x="179" y="30">
        <parameter key="minimum_text_block_length" value="1"/>
      </operator>
      <operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="313" y="30">
        <parameter key="string" value="stk"/>
      </operator>
      <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="447" y="30">
        <parameter key="mode" value="specify characters"/>
        <parameter key="characters" value=" "/>
      </operator>
      <operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="581" y="30">
        <parameter key="string" value="stk"/>
        <parameter key="invert condition" value="true"/>
      </operator>
      <operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data" width="90" x="715" y="30">
        <parameter key="text_attribute" value="Kurse"/>
      </operator>
      <connect from_op="Get Page" from_port="output" to_op="Extract Content (2)" to_port="document"/>
      <connect from_op="Extract Content (2)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
      <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Tokenize" to_port="document"/>
      <connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
      <connect from_op="Filter Tokens (2)" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>
My problem is, that the bid price and the ask price are written to the same attribute (in this case: "Kurs") but i need them as seperated attributes (e.g.: "bid" and "ask"). I thought tokenizing would help but i dont know if it's possible to address a token or something like that (e.g.: after tokenizing save token 1 as "bid" and token 2 as "ask"). I know there is an operator called "Extract Token Number", so this was my first idea to solve my problem.

Maybe someone else has a better idea or a smart solution how to solve this problem?

Regards,
Thomas
Tagged:
Sign In or Register to comment.