I'm back with another tutorial, concluding the posts on how to parse JSON in in RapidMiner by using Old World Computing's WebAutomation extension. I hope you found the tutorials useful, if there are any further questions, don't hesitate to ask! Also, if you are using any of our extensions and would like to see a tutorial about certain features, feel free to send me a message here, or contact us on Twitter or LinkedIn.
In the previous posts, we first discussed the basic functions of the WebAutomation Extension and then demonstrated how to extract not only one, but multiple, relational, example sets from just one JSON string. As mentioned there, we have one more feature to show: extracting arrays of scalar values. If you like, you can also open the tutorial process in RapidMiner, find it under Partner Materials - Old World Computing in the Community Samples Repository.
As we will continue with our example data from before, let’s first have another look at the JSON:
So far, we have discussed extracting the properties of the books
array – title, subtitle, language
and so on. We also covered how to extract the information of the nested authors
array. As you can see above, both the books
and the authors
array, however, are arrays of objects. Having a closer look at the JSON, you will see that there is one array left which we have not yet processed: keywords.
You will also see that keywords
– as opposed to authors
– is an array of single string values and not of nested objects. In the following, we will demonstrate how to extract the information into a third table.
First, here is a reminder of how the inside of the Process Array operator should be looking right now: as we have discussed before, the structure of the process mirrors the original JSON structure. Therefore, we will continue to work on the level of the books array.
We will now add another Process Array operator, connecting it to Multiply and the third Parse Specification port on the right – remember to also make the new connections on all higher levels and between the Process Object and Parse operator in order to receive your ExampleSet.
Click on the new operator to edit its parameters: set “keywords” as property name and for array type, select “scalar values”:
Going into the operator, we will build a similar sub-process to the ones we are using to extract the authors and the other properties. The only difference is that instead of the Extract Properties operator, we will now use the Extract Scalar operator provided by the WebAutomation extension. Enter an attribute name – Keywords – and select the correct attribute type, in this case polynominal. Do not forget to add a Commit Row operator to the sub-process to express that every entry should be represented by a row:
Running the process, you should now get three individual example sets: one showing the properties of the books
array, one with the authors’ names, and a third one with keywords assigned to the books. The keywords
array process is nested within the Process Object operator, which, as you might remember from the previous tutorials, we have set to assign an ID to each JSON object. Thus, the new third ExampleSet will also include an ID corresponding to the other ExampleSets, making relational conclusions possible. (If your data already includes an ID, go here
to read up on how to use it as the connecting element).
This concludes our tutorials for JSON parsing with the new WebAutomation Extension. You should now be able to use this powerful tool to your advantage, increasing efficiency greatly. For further help with the extensions you can also check the tutorials found in the help tab in RapidMiner Studio when selecting one of the extension’s operators. Also be sure to have a look at the other useful functions, such as the JSON request operators, fetching the data directly from a web service.