"Crawling numeric data from webpage"

markus_lauterbamarkus_lauterba RapidMiner Certified Analyst, Member Posts: 2 Contributor I
edited May 2019 in Help
Spoiler
webpage crawling numerics tokenize 

 

Hey, 

I have a problem with a webpage crawling issue where I would like to read prices from a table like this:

https://www.epexspot.com/de/marktdaten/intradaycontinuous/intraday-table/2018-10-24/DE 

 

In order to get the content of the webpage I built the following process. Please have a look here:

 

<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="get page" width="90" x="45" y="34">
<parameter key="url" value="https://www.epexspot.com/de/marktdaten/intradaycontinuous/intraday-table/2018-10-24/DE"/>
<parameter key="random_user_agent" value="true"/>
<parameter key="connection_timeout" value="50000"/>
<parameter key="read_timeout" value="50000"/>
<parameter key="follow_redirects" value="true"/>
<parameter key="accept_cookies" value="all"/>
<parameter key="cookie_scope" value="global"/>
<parameter key="request_method" value="GET"/>
<list key="query_parameters"/>
<list key="request_properties"/>
<parameter key="override_encoding" value="false"/>
<parameter key="encoding" value="SYSTEM"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<operator activated="true" class="text:documents_to_data" compatibility="8.1.000" expanded="true" height="82" name="Documents to Data (2)" width="90" x="179" y="34">
<parameter key="text_attribute" value="Inhalt"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="data_management" value="auto"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<operator activated="true" class="replace" compatibility="9.0.003" expanded="true" height="82" name="Replace" width="90" x="380" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Inhalt"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="replace_what" value="[a-zA-Z €äöü()ÄÖÜß/?&amp;@=&amp;quot;'â&#137;¥]"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<operator activated="true" class="replace" compatibility="9.0.003" expanded="true" height="82" name="Replace (3)" width="90" x="581" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Inhalt"/>
<parameter key="attributes" value=""/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="nominal"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="file_path"/>
<parameter key="block_type" value="single_value"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="single_value"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<parameter key="replace_what" value="132012,.@.2018.."/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="9.0.003">
<operator activated="true" class="store" compatibility="9.0.003" expanded="true" height="68" name="Store" width="90" x="849" y="34">
<parameter key="repository_entry" value="../Data/Extract"/>
</operator>
</process>

Now I have the content of the webpage in an attribute called "Inhalt" but it's all in one cell :( and I dont know how to get it splitt into the format I wish to achieve.

After the header in the content attribute "Inhalt" it looks roughly like this:

'07-08
<>

<>25,01<>
<>50,00<>
<>36,00<>
<>44,20<>
<>44,20<>
<>44,72<>
<>43,39<>
<>4.058,0<>
<>3.553,8<>'

 

So I would like to transform this text into a column time(07-08), price1(25,01), price2(50,00), price3(36,00) ... 

As it is no text all the text modules offered by rapidminer doesnt seem to help me to cut the content into accessible pieces as I wish. 

So could you tell which operator is best to find numbers in such a html-like-code? I think tokenize is getting me into the right direction but I dont know how to get the data into several columns then... 

 

I am happy for any hint and help

 

Greets,

Markus

 

 

Tagged:

Best Answer

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 578 Unicorn
    Solution Accepted
    If you are having trouble with that operator (sometimes it doesn't get it exactly right) and you also want to skip using code to write a webcrawling script. You can use a tool like Octoparse to download the data.

    (Please note, the free version is limited to 10k records per extract so if needing more than that in your table consider an alternative.)
    See here for alternatives: https://alternativeto.net/software/octoparse/

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Get the free Web Data Extraction extension, and then check out the Read HTML Table operator.  I think it does what you are trying to do manually!  

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • markus_lauterbamarkus_lauterba RapidMiner Certified Analyst, Member Posts: 2 Contributor I

    Hey, thanks alot for your answers!

     

    Acutally I left RapidMiner for a solution and used BeautifulSoup Extension in Python as it's free  ;-) 

     

    But thanks a lot for the hint of the Web Table Extraction. It doesnt work in my case but looks like a great tools for "simple" tables. I am sure I will use it soon fo another use case. 

Sign In or Register to comment.