Many organizations are storing data in Google spreadsheets because they offer several advantages over offline spreadsheet solutions. To name a few: high availability of data and ease of collaboration based on sharing rights. Its integration with Google Drive and Google Docs allows to fetch spreadsheets (also having a different formats) from these sources.
The extension provides a ‘Read Google Spreadsheet’ operator, which extracts data from a Google Spreadsheet document and converts it to RapidMiner ExampleSet. Hence, it extends the reach of your data mining processes to live documents, which may be regularly updated e.g., if you are collecting feedback, sales or any other data from multiple customers or stakeholders. Thus, this operator enables you to stay up-to-date with newly arriving data, continuously assess your analytics models and adjust business decisions if necessary.
Let’s take an example to bring these concepts in a unified scenario by virtue of which you can learn the enrichment and integration capabilities of RapidMiner. This not only includes data but also third party services.
Please install the following extensions through the RapidMiner Marketplace to ensure you can reproduce the steps presented next:
A Unified Example Scenario for Text Analytics (Extract, Enrich, Process, Interpret):
We consider the simple case of a book (or another product for that matter) review for demonstration purpose only. A good debut of a book may start with large purchases due to initial hype, but for the book stores, publishers or resellers, user feedback is essential to determine the consumer/market sentiment over time. Insights based on sentiment analysis can lead to e.g., better inventory management or creating improved selling propositions.
We take a subset of customer reviews for the book ‘The Martian’ as made on amazon.com. After cleaning the data, the review text (excluding ranking information) is made available as a Google spreadsheet at . The ‘Read Google Spreadsheet’ operator reads this sheet in RapidMiner as shown
in Fig. 1. Simply provide it the url of spreadsheet, the sheet name and a client secret file.
Obtaining the client secret file:
As Google spreadsheets are managed by Google API servers, you need to turn-on the Google Sheets API for your Google account and get a client secret (JSON) file, which contains your authentication credentials. To obtain it, you can follow the following steps:
We have already extracted data from a Google spreadsheet. Now let us set two objectives towards analysing this text data:
To analyse sentiment, one approach is to use a dictionary such as the RapidMiner Wordnet dictionary as explained here . The other approach that we will use is to use a third-party service, that already provides such a classification on whole sentences. In this example, we will use IBM Watson's Natural Language Understanding (NLU) API - a state of the art in natural language processing (NLP) technologies to classify examples in our dataset as positive, negative or neutral. Fig. 2 shows the RapidMiner process that achieves this.
As seen, we filter examples with the length of review text limited to one thousand characters in length. This leaves us with 600 examples. We then stuff this text (found under attribute name ‘review’) in the payload of a webservice request that will be invoked using the ‘Enrich Data by Webservice’ operator (from the Web Mining extension). The operator makes an API call for each example of the ExampleSet provided to it. To get authenticated by the API, just create a free user account at  and you will receive your login credentials. Upon first invocation, the operator prompts for your username and password as shown in Fig. 3.
The main parameters of the ‘Enrich Data by Webservice’ operator are configured to satisfy the API requirements  as follows:
As seen in Fig. 4, we enquire Watson about several features:
In this way, we have enriched our dataset using an external system. This demonstrates how API based systems can be harnessed rather conveniently and hooked with our RapidMiner processes to develop well integrated quasi service-driven data mining workflows – all within RapidMiner Studio!
Other advanced features can be enquired from Watson as well, but our free account is limited in terms of API calls per day, payload size (that’s why we used reviews of limited length earlier) and the number of features requested – all billable items that need careful consideration.
By now we have enriched the data with a third-party sentiment classification service. This data set is made available at  for your reference. What remains to be identified are words that influence these class labels the most. We will use feature weights to determine this, but before that, we prepare an unbiased dataset comprising an equal number of positive and negative reviews. This gives us a reduced dataset of 304 examples, 152 for positive and negative, while ignoring the neutral ones as they are less discriminating and hence less interesting.
Next, we use the Text Processing extension to build a standard operator chain inside the ‘Process Documents from Data’ operator, as seen in Fig. 5. Use TF-IDF for word vector creation and pruning methods to see the impact on number of attributes (tokens) you get. You will notice that percentual pruning in the range [3.0, 30.0] already reduces attributes from 10633 to 223, with no negative effect on the results but noticeable difference in execution speed of the process due to processing complexity. The sub-process performs tokenization, filtering out tokens smaller or larger than certain lengths as well as stop words as these bring no value, change case, apply stem so similar origin words are considered related, and finally we apply n-Grams as 2-Grams. The latter is helpful in analysing writing styles that use adjective declinations e.g., brilliant plot, excellent book, etc. This complex processing is relatively straight forward in RapidMiner.
The complete process uses several steps and is attached with this article for your reference.
Depending on the text, tokenization can result in many attributes (despite pruning). The question is which of these features are rather important? Weights to the rescue! The weight of an attribute gives its importance in relation to the target attribute, also called label. Once we have the weights, we can associate our preference to attributes.
RapidMiner offers more than a dozen operators that can characterize attributes with weights. We assign the role of label to the polarity attribute and split the data in two sets (one containing positive and other negative examples). Then we simply apply ‘Weight by Value Average’ operator on each set to weight attributes against their class. Next, the ‘Select by Weights’ operator can filter out attributes whose weights are above a desired threshold, say 0.6. This gives us most influencing attributes (in terms of their TF-IDF occurrence frequencies). Feature weighing is explained in a community article available at  which also touches on feature selection in RapidMiner.
We can now interpret results of our text analytic process. Three main insights were discovered.
great, enjoy, love, recommend, make, fiction, survive, read_book, book_read, mark, science_fiction
Notice the combinations like read_book and science_fiction, captured by the 2-Gram n-Gram. Similarly, the influential words associated exclusively with negative reviews are:
martian, mar, detail, technic, page, novel, plot, know, found, finish, work, write, movie, weir
A bit tricky are the overlapping words i.e. those that appear in both positive and negative reviews. These are: “good, time, author, interest”. One way to analyse these is to look at the distribution of their TF-IDF values in data, as shown for the word ‘time’ and ‘author’ in Fig. 6.
Except for the Quantity entity, Watson seems quite confident about the entities it detected (relevance > 0.8). The entity Person tops the count and the strongest emotion associated with it is sadness (value > 0.3). The concatenation of entity names (shown as the last column) gives clues in that direction.
In this article, you learned about the new extension for reading Google Spreadsheets as ExampleSets in RapidMiner. Using this as a starting point, a more holistic scenario was presented. We gradually moved from data extraction to data enrichment through a third-party API. We then performed text analytics using out-of-the-box features of RapidMiner. Finally, we got a glimpse of how the amazon.com customers reviewed the book ‘The Martian’. Our limited dataset served as an educating exercise and is of course not meant to influence the book sales or market opinion in any way.
The Spreadsheet Table Extraction extension is developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com) which is sponsored by the German ministry of education and research (BMBF).
 A sample Google Spreadsheet, sheet name ‘The Martian’, weblink: https://docs.google.com/spreadsheets/d/1vRJi3Ur3w6-9WhOa0G-vJ6GR4-RWGiBCwrQok5ILsow/edit#gid=1779829...
 A sample Google Spreadsheet, sheet name ‘Sentiments’, weblink: https://docs.google.com/spreadsheets/d/1vRJi3Ur3w6-9WhOa0G-vJ6GR4-RWGiBCwrQok5ILsow/edit#gid=1445561...
 RapidMiner Community article, Sentiment Analysis using Wordnet Dictionary, weblink: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/Sentiment-Analysis-using-Wordnet-Dic...
 IBM API Accounts, weblink: https://myibm.ibm.com/dashboard/
 IBM NLU API reference, weblink: https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/#sentiment
 Endpoint Reference of IBM NLU API, weblink: https://gateway.watsonplatform.net/natural-language-understanding/api/v1/analyze?version=2017-02-27
 RapidMiner Community article, Feature Weighting Tutorial, weblink: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/Feature-Weighting-Tutorial/ta-p/...
 Rosette extension for RapidMiner, weblink: https://marketplace.rapidminer.com/UpdateServer/faces/download.xhtml?productId=rmx_rosette_text_tool...
 Aylien extension for RapidMiner, weblink: https://marketplace.rapidminer.com/UpdateServer/faces/download.xhtml?productId=rmx_com.aylien.textap...
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.