Options

Downloading a webpage for every 5 minutes?

MikeRMikeR Member Posts: 2 Contributor I
Hi everybody,

I'm new to this forum, so i hope i have posted this the right place.

I am doing my bachelor thesis about an online forum, and thereby want to monitor the activity on the forum.
At the front page www.lydmaskinen.dk there is a # of people online at that particular time in the bottom of the page
- does any of you know a way I can download this information for every 5 minutes in a given time period?

I thought about downloading the whole sourcecode/webpage for every 5 minutes, and afterwards just manually log the data in an excel spreadsheet.
There might ofc. be a much more clever way around this, but I consider that a luxury problem at the moment.

But does anyone know a simple way of doing this?

Thanks,
- Mike(DK)

Answers

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hello Mike,

    you can use the webmining and Text mining extension to get the information. It works quite good with a small regular expression.

    Attached is a process extracting the number of registered users. It's straight forward to get the number of guests.

    You can run this process on a RapidMiner Server automatically. Then you can directly store the information in a repository and work with it. There is by the way an academic program which would allow you to get a rapidminer server for your thesis. If you need more information just write an email to me: mschmitz@rapidminer.com

    Best,

    Martin

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.2.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="web:get_webpage" compatibility="5.3.002" expanded="true" height="60" name="Get Page" width="90" x="112" y="30">
            <parameter key="url" value="http://www.lydmaskinen.dk/index.php"/>
            <list key="query_parameters"/>
            <list key="request_properties"/>
          </operator>
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.002" expanded="true" height="60" name="Extract Content" width="90" x="246" y="30"/>
          <operator activated="true" class="text:documents_to_data" compatibility="6.1.000" expanded="true" height="76" name="Documents to Data" width="90" x="380" y="30">
            <parameter key="text_attribute" value="Data"/>
          </operator>
          <operator activated="true" class="text:generate_extract" compatibility="6.1.000" expanded="true" height="60" name="Generate Extract" width="90" x="514" y="30">
            <parameter key="source_attribute" value="Data"/>
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries">
              <parameter key="registered users" value=" .*users online.*([0-9])\sregistered.*guests.* "/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_op="Get Page" from_port="output" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_op="Generate Extract" to_port="Example Set"/>
          <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.