Accessing Wikipedia API using RapidMiner Web Mining Extension

sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
edited February 2020 in Knowledge Base

This is a quick article about how to use the Enrich Data via Webservice operator (found in the Web Mining extension) to get information about Wikipedia via their REST API webservice.  This API can find many different sources of information such as page views, formula grabs, unique device counts, etc..  Full documentation can be found here: https://wikimedia.org/api/rest_v1

 This particular API is VERY easy to use - there is no authentication and the only limitation is a 200 query count per day.  Simply enter the URL, insert the relevant attributes or macros, and set up the JSON paths to organize the output.  Boom.

This is an example of a short process that check the page count of the RapidMiner Wikipedia page (of course) the day prior to when the process is executed.

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="generate_data_user_specification" compatibility="7.6.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="85">
<list key="attribute_values">
<parameter key="startdate" value="date_add(date_now(),-1,DATE_UNIT_DAY)"/>
<parameter key="enddate" value="startdate"/>
</list>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="date_to_nominal" compatibility="7.6.001" expanded="true" height="82" name="Date to Nominal" width="90" x="179" y="85">
<parameter key="attribute_name" value="startdate"/>
<parameter key="date_format" value="yyyyMMdd"/>
</operator>
<operator activated="true" class="date_to_nominal" compatibility="7.6.001" expanded="true" height="82" name="Date to Nominal (2)" width="90" x="313" y="85">
<parameter key="attribute_name" value="enddate"/>
<parameter key="date_format" value="yyyyMMdd"/>
</operator>
<operator activated="true" class="web:enrich_data_by_webservice" compatibility="7.3.000" expanded="true" height="68" name="Enrich Data by Webservice" width="90" x="447" y="85">
<parameter key="query_type" value="JsonPath"/>
<list key="string_machting_queries"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries"/>
<list key="namespaces"/>
<list key="index_queries"/>
<list key="jsonpath_queries">
<parameter key="project" value="$..project"/>
<parameter key="article" value="$..article"/>
<parameter key="granularity" value="$..granularity"/>
<parameter key="timestamp" value="$..timestamp"/>
<parameter key="access" value="$..access"/>
<parameter key="agent" value="$..agent"/>
<parameter key="views" value="$..views"/>
</list>
<parameter key="url" value="https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/RapidMiner/daily/&amp;lt;%startdate%&amp;gt;/&amp;lt;%enddate%&amp;gt;"/>
<list key="request_properties"/>
</operator>
<connect from_op="Generate Data by User Specification" from_port="output" to_op="Date to Nominal" to_port="example set input"/>
<connect from_op="Date to Nominal" from_port="example set output" to_op="Date to Nominal (2)" to_port="example set input"/>
<connect from_op="Date to Nominal (2)" from_port="example set output" to_op="Enrich Data by Webservice" to_port="Example Set"/>
<connect from_op="Enrich Data by Webservice" from_port="ExampleSet" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

Enjoy!

Scott

Comments

  • sharmar6sharmar6 Member Posts: 19 Maven

    Thanks Scott..!!! 

    :smileyvery-happy:

  • sharmar6sharmar6 Member Posts: 19 Maven

    Hi Scott,

    I am using this API but couldnt figure out how to pass token and then again access a different endpoint.

    Could you please advice where to put the token and the query parameter.

     

    Thanks.

     

    Capture.JPG

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    so the -H curl request means that it must be included in the header.  With the Enrich Data via Webservice operator, it's done in the advanced parameter "request properties":

     

    Screen Shot 2017-09-21 at 7.03.56 PM.png

     

    For that API, it seems they will let you include the token right in the URL instead of in a header.  So in the operator, you are just going to do the URL like it says:

     

    Screen Shot 2017-09-21 at 7.05.49 PM.png

     

    Same difference.

     

    Scott

  • sharmar6sharmar6 Member Posts: 19 Maven

    Thanks.

    I also need to fetch a json response which is on another page, after authentication  

    How do I make RM to  use GET/other_data after authentication.

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    so that's what this operator does: a GET request which generally returns a JSON file as a response.  If you set the query type to Regular Expression and the query expression to .*, you will see the whole response.


    Scott

  • sharmar6sharmar6 Member Posts: 19 Maven

    The response that comes after authentication is of no use. I need the response from GET/posts which has information I am looking for. But if I put the token in the url, it stops after giving the initial resonse (landing page). I need to access other page after getting authenticated. How should I form my url and request parameter so that I am able to authenticate and move to other page as well.

Sign In or Register to comment.