The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Parsing JSON with OWC's WebAutomation Extension: Extracting two or more relational example sets
Jana_OWC
Member, KB Contributor Posts: 14 Contributor II
Hi everyone,
a while ago I posted a tutorial explaining how to use the WebAutomation RapidMiner extension by Old World Computing to parse JSON within RapidMiner (see here). That tutorial dealt with how to extract one ExampleSet from a JSON string; it gets much more interesting however, when you know how to extract two or more relational ExampleSets! Read on to learn how to do just that.
If you've got any questions, don't hesitate to ask! You can also find the tutorial process in the RapidMiner community samples repository: In RapidMiner, go to Community Samples/Community Partner Materials/Old World Computing/JSON tutorial process with Read Document.
Where We Left Off
We will continue the tutorial with
the information about three books we also used in the first part of this tutorial. What
we have extracted so far are title, subtitle, language and edition as well as publication date and publishing company – the properties of the books array. Taking another look at the JSON however, we will find two more arrays nested inside this one: authors and keywords. Both of these can have multiple entries, see for example the three authors of the second book.
Simply adding them to our existing
example set would be impractical, as books with more than one author
would force us to decide whether we want to have only one row per
book, but then we can only get the last entry of the nested arrays. That
would mean that we would only store Jerome Friedman, ignoring Mr Hastie
and Mr Tibshirani’s contribution to the data scientist’s bible
“Elements of statistical learning”. The only alternative would be to
have three rows for the book, but then we would have multiple copies of
the master data with it’s title, subtitle and so on.
To avoid this, we will show how to create a second table show the author’s names and how to link that with the master data of the first table.
For this tutorial, we will focus on creating a second table to relate the authors set to our first ExampleSet, but of course it is also possible to create a third table for keywords – we will discuss how to deal with arrays of values in the next part of this tutorial.
To avoid this, we will show how to create a second table show the author’s names and how to link that with the master data of the first table.
For this tutorial, we will focus on creating a second table to relate the authors set to our first ExampleSet, but of course it is also possible to create a third table for keywords – we will discuss how to deal with arrays of values in the next part of this tutorial.
Extracting a Second Array
To be able to access the authors array within the books array, simply add another Process Array operator to your process.
Basically, this is simply a rerun of
what we did before to create the first example set: enter the Process
Array operator and add Extract Properties and Commit Row operators to
form a second ExampleSet. The tree view makes the similarity even
clearer:
In Extract Properties, enter first
name and last name as the properties to be extracted and go back to the
first Process Array level. Here, we still need to make our port
connections: using Multiply, connect the incoming left par
(parse specification) port with both the Extract Properties and the
nested Process Array operators. Going out from Process Array, make the
connection to the second, still unused, par
outgoing port. Make sure you connect the output ports on all the higher
levels and also between the Process Object and the Parse operator. As
you can see in the second screenshot below, the Parse Operator has
multiple par ports
to receive multiple parse specifications. For each incoming
specification, the operator will generate an individual ExampleSet.
Establishing a Connection between the ExampleSets
To create an ID relating the two
example sets to one another, go to the parameter settings of the first
Process Array and select “create id attribute” and give it a name of
your choice:
An ID with an auto incremented
number will now be assigned to every object in the array and add it as
an attribute in the resulting data sets:
Usually
the JSON should already include an ID value. In this case, you will
probably want to use that as the connection between the sets instead of
the ID created by the extension. To that end, add another Extract
Properties operator before the Multiply operator, and select your ID
property as an attribute to be generated. If you had previously added ID
as a property to be extracted in the first (now second) Extract
Properties operator, make sure to remove it from there. Putting it in
front of the Multiply will add it to both example sets, as it is now
contained in the parse specification that is fed into the Process Array
operator for the authors, as you can see here:
We hope this helps you with using the WebAutomation Extension by Old World Computing for extracting JSON! We will soon post a third part of this tutorial on how to extract arrays of scalar values.
Tagged:
4