Options

Parsing JSON with OWC's WebAutomation Extension: Extracting two or more relational example sets

Jana_OWCJana_OWC Moderator, Member, KB Contributor Posts: 14 Contributor II
Hi everyone,

a while ago I posted a tutorial explaining how to use the WebAutomation RapidMiner extension by Old World Computing to parse JSON within RapidMiner (see here). That tutorial dealt with how to extract one ExampleSet from a JSON string; it gets much more interesting however, when you know how to extract two or more relational ExampleSets! Read on to learn how to do just that.

If you've got any questions, don't hesitate to ask! You can also find the tutorial process in the RapidMiner community samples repository: In RapidMiner, go to Community Samples/Community Partner Materials/Old World Computing/JSON tutorial process with Read Document.

Where We Left Off

We will continue the tutorial with the information about three books we also used in the first part of this tutorial. What we have extracted so far are title, subtitle, language and edition as well as publication date and publishing company – the properties of the books array. Taking another look at the JSON however, we will find two more arrays nested inside this one: authors and keywords. Both of these can have multiple entries, see for example the three authors of the second book.



Simply adding them to our existing example set would be impractical, as books with more than one author would force us to decide whether we want to have only one row per book, but then we can only get the last entry of the nested arrays. That would mean that we would only store Jerome Friedman, ignoring Mr Hastie and Mr Tibshirani’s contribution to the data scientist’s bible “Elements of statistical learning”. The only alternative would be to have three rows for the book, but then we would have multiple copies of the master data with it’s title, subtitle and so on.
To avoid this, we will show how to create a second table show the author’s names and how to link that with the master data of the first table.
For this tutorial, we will focus on creating a second table to relate the authors set to our first ExampleSet, but of course it is also possible to create a third table for keywords – we will discuss how to deal with arrays of values in the next part of this tutorial.

Extracting a Second Array

This tutorial builds on the process created in the first part. If you're not sure how to get there, click the link in the introduction above to have a look at the previous part. 
To be able to access the authors array within the books array, simply add another Process Array operator to your process.



Basically, this is simply a rerun of what we did before to create the first example set: enter the Process Array operator and add Extract Properties and Commit Row operators to form a second ExampleSet. The tree view makes the similarity even clearer:



In Extract Properties, enter first name and last name as the properties to be extracted and go back to the first Process Array level. Here, we still need to make our port connections: using Multiply, connect the incoming left par (parse specification) port with both the Extract Properties and the nested Process Array operators. Going out from Process Array, make the connection to the second, still unused, par outgoing port. Make sure you connect the output ports on all the higher levels and also between the Process Object and the Parse operator. As you can see in the second screenshot below, the Parse Operator has multiple par ports to receive multiple parse specifications. For each incoming specification, the operator will generate an individual ExampleSet.


 



Start the process and you should be getting two ExampleSets. These are, however, not yet related: You will see one ExampleSet showing the author’s names and a second set with the properties we extracted in the first part of this tutorial, but from that, you will not be able to see which book was written by which author(s). What we need is an ID appearing in both sets.

Establishing a Connection between the ExampleSets

To create an ID relating the two example sets to one another, go to the parameter settings of the first Process Array and select “create id attribute” and give it a name of your choice:



An ID with an auto incremented number will now be assigned to every object in the array and add it as an attribute in the resulting data sets:



As these IDs correspond to the same object in both ExampleSets, you can now compare the two data sets and see which author belongs to which book.
Usually the JSON should already include an ID value. In this case, you will probably want to use that as the connection between the sets instead of the ID created by the extension. To that end, add another Extract Properties operator before the Multiply operator, and select your ID property as an attribute to be generated. If you had previously added ID as a property to be extracted in the first (now second) Extract Properties operator, make sure to remove it from there. Putting it in front of the Multiply will add it to both example sets, as it is now contained in the parse specification that is fed into the Process Array operator for the authors, as you can see here:




We hope this helps you with using the WebAutomation Extension by Old World Computing for extracting JSON! We will soon post a third part of this tutorial on how to extract arrays of scalar values.






Sign In or Register to comment.