RapidMiner

The ‘Spreadsheet Table Extraction’ - Extension Release
Unicorn
‎07-13-2017 07:04 AM
1216 Views
0 Comments

Read data from Google Spreadsheets, enrich using external APIs and analyse text – all in RapidMiner!

Read more...

By: Edwin Yaqub, PhD

In this article, you will be introduced to the ‘Spreadsheet Table Extraction’ - a new extension developed by RapidMiner Research. This article also provides a walk-through on how this extension may fit into your analysis process chain.

 

Motivation:

 

Many organizations are storing data in Google spreadsheets because they offer several advantages over offline spreadsheet solutions. To name a few: high availability of data and ease of collaboration based on sharing rights. Its integration with Google Drive and Google Docs allows to fetch spreadsheets (also having a different formats) from these sources.

 

The extension provides a ‘Read Google Spreadsheet’ operator, which extracts data from a Google Spreadsheet document and converts it to RapidMiner ExampleSet. Hence, it extends the reach of your data mining processes to live documents, which may be regularly updated e.g., if you are collecting feedback, sales or any other data from multiple customers or stakeholders. Thus, this operator enables you to stay up-to-date with newly arriving data, continuously assess your analytics models and adjust business decisions if necessary.

 

Let’s take an example to bring these concepts in a unified scenario by virtue of which you can learn the enrichment and integration capabilities of RapidMiner. This not only includes data but also third party services.

 

Pre-Requisites:

 

Please install the following extensions through the RapidMiner Marketplace to ensure you can reproduce the steps presented next:

  • Web Mining Extension
  • Text Processing Extension
  • Spreadsheet Table Extraction Extension

 

A Unified Example Scenario for Text Analytics (Extract, Enrich, Process, Interpret):

 

We consider the simple case of a book (or another product for that matter) review for demonstration purpose only. A good debut of a book may start with large purchases due to initial hype, but for the book stores, publishers or resellers, user feedback is essential to determine the consumer/market sentiment over time. Insights based on sentiment analysis can lead to e.g., better inventory management or creating improved selling propositions.

 

Extract:

We take a subset of customer reviews for the book ‘The Martian’ as made on amazon.com. After cleaning the data, the review text (excluding ranking information) is made available as a Google spreadsheet at [1]. The ‘Read Google Spreadsheet’ operator reads this sheet in RapidMiner as shown
in Fig.  1. Simply provide it the url of spreadsheet, the sheet name and a client secret file.

Image1.png

 

 

 

 

Obtaining the client secret file:

 

As Google spreadsheets are managed by Google API servers, you need to turn-on the Google Sheets API for your Google account and get a client secret (JSON) file, which contains your authentication credentials. To obtain it, you can follow the following steps:

  1. Visit Google Developers Console at 'https://console.developers.google.com/flows/enableapi?apiid=sheets.googleapis.com'.
  2. Create or select a project from the list box. This also enables API for your project.
  3. Go to 'credentials'. Now you land on the 'Add credentials to your project' page. Click Cancel.
  4. Now select 'OAuth consent screen' tab. Provide email and a product name and click Save.
  5. Now you are at the 'Credentials' tab. Click the 'Create credentials' list box button and select 'OAuth client ID'.
  6. From the list of options, select 'Other', provide a name and click Create.
  7. Now you see your 'OAuth 2.0 client ID' under the Credentials tab. Click the 'Download JSON' link. This gives you your client secret file.

 

Enrich:

We have already extracted data from a Google spreadsheet. Now let us set two objectives towards analysing this text data:

  • Understanding the polarity of users' sentiment towards this book through a third-party service, thereby enriching the data further.
  • Identifying the more frequently used words that influence the polarity classes.

  
To analyse sentiment, one approach is to use a dictionary such as the RapidMiner Wordnet dictionary as explained here [3]. The other approach that we will use is to use a third-party service, that already provides such a classification on whole sentences. In this example, we will use IBM Watson's Natural Language Understanding (NLU) API - a state of the art in natural language processing (NLP) technologies to classify examples in our dataset as positive, negative or neutral. Fig.  2 shows the RapidMiner process that achieves this.

Image2.pngFig. 2 RapidMiner process to read Google spreadsheet, some filtering and enrichment by Webservice invocation
As seen, we filter examples with the length of review text limited to one thousand characters in length. This leaves us with 600 examples. We then stuff this text (found under attribute name ‘review’) in the payload of a webservice request that will be invoked using the ‘Enrich Data by Webservice’ operator (from the Web Mining extension). The operator makes an API call for each example of the ExampleSet provided to it. To get authenticated by the API, just create a free user account at [4] and you will receive your login credentials. Upon first invocation, the operator prompts for your username and password as shown in Fig.  3.

Image3.png

 

 

The main parameters of the ‘Enrich Data by Webservice’ operator are configured to satisfy the API requirements [5] as follows:

  • request method = POST, request properties = Content-Type: application/json
  • query type = Regular Expression
  • attribute type = Nominal
  • Point the url to [6]
  • 4 shows the value for body parameter (the JSON to send as a request) and the regular expression queries we apply to extract results from the response JSON.

 

Image4.png

 

 

As seen in Fig.  4, we enquire Watson about several features:

  • the sentiment (polarity and score) to be assessed from the document i.e. the chunk of text (value of our ‘review’ attribute). If we think how we humans share our experience about something, we often talk about an object or place or a personality in terms of our impressions. Such information can help understand the context of our customer reviews. Like some other systems [8,9] Watson can also extract entities from text as Person, a Quantity, an Organization, a Movie, a Location, a Company, a Broadcaster, a Job Title, etc.
  • Further we can assess emotion (as a score between [0,1]) associated with the entity e.g., disgust, joy, sadness, fear or anger.

 

In this way, we have enriched our dataset using an external system. This demonstrates how API based systems can be harnessed rather conveniently and hooked with our RapidMiner processes to develop well integrated quasi service-driven data mining workflows – all within RapidMiner Studio!

 

Other advanced features can be enquired from Watson as well, but our free account is limited in terms of API calls per day, payload size (that’s why we used reviews of limited length earlier) and the number of features requested – all billable items that need careful consideration.

 

Process:

By now we have enriched the data with a third-party sentiment classification service. This data set is made available at [2] for your reference. What remains to be identified are words that influence these class labels the most. We will use feature weights to determine this, but before that, we prepare an unbiased dataset comprising an equal number of positive and negative reviews. This gives us a reduced dataset of 304 examples, 152 for positive and negative, while ignoring the neutral ones as they are less discriminating and hence less interesting.

 

Next, we use the Text Processing extension to build a standard operator chain inside the ‘Process Documents from Data’ operator, as seen in Fig.  5. Use TF-IDF for word vector creation and pruning methods to see the impact on number of attributes (tokens) you get. You will notice that percentual pruning in the range [3.0, 30.0] already reduces attributes from 10633 to 223, with no negative effect on the results but noticeable difference in execution speed of the process due to processing complexity. The sub-process performs tokenization, filtering out tokens smaller or larger than certain lengths as well as stop words as these bring no value, change case, apply stem so similar origin words are considered related, and finally we apply n-Grams as 2-Grams. The latter is helpful in analysing writing styles that use adjective declinations e.g., brilliant plot, excellent book, etc. This complex processing is relatively straight forward in RapidMiner.

Image5.png

 

 

The complete process uses several steps and is attached with this article for your reference.

 

Depending on the text, tokenization can result in many attributes (despite pruning). The question is which of these features are rather important? Weights to the rescue! The weight of an attribute gives its importance in relation to the target attribute, also called label. Once we have the weights, we can associate our preference to attributes.

 

RapidMiner offers more than a dozen operators that can characterize attributes with weights. We assign the role of label to the polarity attribute and split the data in two sets (one containing positive and other negative examples). Then we simply apply ‘Weight by Value Average’ operator on each set to weight attributes against their class. Next, the ‘Select by Weights’ operator can filter out attributes whose weights are above a desired threshold, say 0.6. This gives us most influencing attributes (in terms of their TF-IDF occurrence frequencies). Feature weighing is explained in a community article available at [7] which also touches on feature selection in RapidMiner.

Interpret:

We can now interpret results of our text analytic process. Three main insights were discovered.

  1. Overall polarity score: This showed that the average positive value is higher than average negative (+0.5 vs -0.34). This sounds good for the author, publishers and resellers.
  2. The bag of influential words that are found exclusively in positive reviews are shown below:

great, enjoy, love, recommend, make, fiction, survive, read_book, book_read, mark, science_fiction

 

 

 

Notice the combinations like read_book and science_fiction, captured by the 2-Gram n-Gram. Similarly, the influential words associated exclusively with negative reviews are:

martian, mar, detail, technic, page, novel, plot, know, found, finish, work, write, movie, weir

 

 

 

 

A bit tricky are the overlapping words i.e. those that appear in both positive and negative reviews. These are: “good, time, author, interest”. One way to analyse these is to look at the distribution of their TF-IDF values in data, as shown for the word ‘time’ and ‘author’ in Fig.  6.

 

 

Image6.png

 

 

  1. The entity analysis based on average values reveals that of the 223 reviews (quite a small dataset), 168 talked about some entities and associated a certain emotion with them as seen in the table below:

 

Image7.png

 

 

Except for the Quantity entity, Watson seems quite confident about the entities it detected (relevance > 0.8). The entity Person tops the count and the strongest emotion associated with it is sadness (value > 0.3). The concatenation of entity names (shown as the last column) gives clues in that direction.

 

Conclusion:

 

In this article, you learned about the new extension for reading Google Spreadsheets as ExampleSets in RapidMiner. Using this as a starting point, a more holistic scenario was presented. We gradually moved from data extraction to data enrichment through a third-party API. We then performed text analytics using out-of-the-box features of RapidMiner. Finally, we got a glimpse of how the amazon.com customers reviewed the book ‘The Martian’. Our limited dataset served as an educating exercise and is of course not meant to influence the book sales or market opinion in any way.

 

Acknowledgments:

 

The Spreadsheet Table Extraction extension is developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com) which is sponsored by the German ministry of education and research (BMBF).

 

 

References:

[1] A sample Google Spreadsheet, sheet name ‘The Martian’, weblink: https://docs.google.com/spreadsheets/d/1vRJi3Ur3w6-9WhOa0G-vJ6GR4-RWGiBCwrQok5ILsow/edit#gid=1779829...

[2] A sample Google Spreadsheet, sheet name ‘Sentiments’, weblink: https://docs.google.com/spreadsheets/d/1vRJi3Ur3w6-9WhOa0G-vJ6GR4-RWGiBCwrQok5ILsow/edit#gid=1445561...

[3] RapidMiner Community article, Sentiment Analysis using Wordnet Dictionary, weblink: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/Sentiment-Analysis-using-Wordnet-Dic...

[4] IBM API Accounts, weblink: https://myibm.ibm.com/dashboard/

[5] IBM NLU API reference, weblink: https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/#sentiment

[6] Endpoint Reference of IBM NLU API, weblink: https://gateway.watsonplatform.net/natural-language-understanding/api/v1/analyze?version=2017-02-27

[7] RapidMiner Community article, Feature Weighting Tutorial, weblink: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/Feature-Weighting-Tutorial/ta-p/...

[8] Rosette extension for RapidMiner, weblink: https://marketplace.rapidminer.com/UpdateServer/faces/download.xhtml?productId=rmx_rosette_text_tool...

[9] Aylien extension for RapidMiner, weblink: https://marketplace.rapidminer.com/UpdateServer/faces/download.xhtml?productId=rmx_com.aylien.textap...

Regards,
Thomas

Blog: Neural Market Trends

RapidMiner Tutorial Videos here!