Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

Downloading a webpage for every 5 minutes?

MikeRMikeR Member Posts: 2 Contributor I
Hi everybody,

I'm new to this forum, so i hope i have posted this the right place.

I am doing my bachelor thesis about an online forum, and thereby want to monitor the activity on the forum.
At the front page www.lydmaskinen.dk there is a # of people online at that particular time in the bottom of the page
- does any of you know a way I can download this information for every 5 minutes in a given time period?

I thought about downloading the whole sourcecode/webpage for every 5 minutes, and afterwards just manually log the data in an excel spreadsheet.
There might ofc. be a much more clever way around this, but I consider that a luxury problem at the moment.

But does anyone know a simple way of doing this?

Thanks,
- Mike(DK)

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,529 RM Data Scientist
    Hello Mike,

    you can use the webmining and Text mining extension to get the information. It works quite good with a small regular expression.

    Attached is a process extracting the number of registered users. It's straight forward to get the number of guests.

    You can run this process on a RapidMiner Server automatically. Then you can directly store the information in a repository and work with it. There is by the way an academic program which would allow you to get a rapidminer server for your thesis. If you need more information just write an email to me: mschmitz@rapidminer.com

    Best,

    Martin

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.2.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="6.2.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="web:get_webpage" compatibility="5.3.002" expanded="true" height="60" name="Get Page" width="90" x="112" y="30">
            <parameter key="url" value="http://www.lydmaskinen.dk/index.php"/>
            <list key="query_parameters"/>
            <list key="request_properties"/>
          </operator>
          <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.002" expanded="true" height="60" name="Extract Content" width="90" x="246" y="30"/>
          <operator activated="true" class="text:documents_to_data" compatibility="6.1.000" expanded="true" height="76" name="Documents to Data" width="90" x="380" y="30">
            <parameter key="text_attribute" value="Data"/>
          </operator>
          <operator activated="true" class="text:generate_extract" compatibility="6.1.000" expanded="true" height="60" name="Generate Extract" width="90" x="514" y="30">
            <parameter key="source_attribute" value="Data"/>
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <list key="regular_expression_queries">
              <parameter key="registered users" value=" .*users online.*([0-9])\sregistered.*guests.* "/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_op="Get Page" from_port="output" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
          <connect from_op="Documents to Data" from_port="example set" to_op="Generate Extract" to_port="Example Set"/>
          <connect from_op="Generate Extract" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.