Check out our Rosette Text Toolkit extension for RapidMiner and plug Rosette text analytics directly into your RapidMiner workflows. More info here: https://www.rosette.com/
Get up and running with Rosette for RapidMiner Studio with this quick start guide, which covers the installation and setup process. We also demonstrate how to get started extracting and linking entities with Rosette.
Installing RapidMiner and Rosette
If you aren’t already running RapidMiner Studio, download the application on RapidMiner’s website, to download the Rosette Text Toolkit extension, open RapidMiner Studio, navigate to the Extensions menu and select Marketplace.
A new window will open. Search for “rosette” and select Rosette Text Toolkit from the list of results. Click the Install 1 Packages button at the bottom of the window and follow the click-through instructions to complete the installation.
Once the extension has finished installing, the Rosette operators will be visible in the Extensions folder of the Operators panel.
Getting a Rosette API Key
In order to activate the Rosette Text Toolkit for RapidMiner Studio, you’ll need an API key and a Rosette developer account. Head over to developer.rosette.com and complete the signup process.
You can create an account linked to either your email or your GitHub account. No credit card is required — our default plan gives you 10,000 calls a day for free! If you’re interested in upping your call quota, check out our paid plans.
Once you have completed the signup process and verified your account, click on the API Key tab on the top left of the menu bar to display your key.
Setting up your Rosette API Connection
Back in RapidMiner Studio, input your Rosette API key to start using any of Rosette’s operators. We’ll be looking at the entity extraction operator in the next section, so we’ll use it to set up the Rosette API connection now.
First, locate Extract Entities in the Rosette Text Toolkit folder in the Operators panel and drag it to the Process panel.
You can see the various settings options for the Extract Entities operator in the The Parameters panel to the right of the Process panel. The first parameter is Connection. Click the Rosette icon to the right of the box.
The Manage Connections window will open. Click the Add Connection button on the bottom left and select Rosette Connection from the Connection type dropdown list. Name your new connection and click the Create button.
Select your new Rosette API connection from the list on the left and enter your Rosette API key in the API KEY box. Use the Test button at the bottom of the window to verify that your connection is working. If you run into any trouble, confirm that you have copied your API key correctly. When you are satisfied that everything is running smoothly, click the Save all changes button to return to the Parameters panel.
Select your new connection from the Connection dropdown list.
Now that you’ve installed the Rosette for RapidMiner extension and set up your API key and connection, you’re almost ready to start analyzing. Last step: download RapidMiner’s Text Processing extension in the RapidMiner Marketplace, a helpful set of operators that allow you to load, filter, and analyze text from a variety of different sources. With that installed, head to RapidMiner Studio where we’ll use three operators to create a simple entity extraction workflow, or process: Create Document and Documents to Data from Text Processing, and Extract Entities from Rosette. Drag these operators into the Process panel and connect them together, maintaining the order listed above. You can find the operators using the Operators Search Bar.
Select the Create Document operator. In the parameter panel, check the add label box. Under label type, select text and enter ‘my_text’ for label value. Click the Edit Text button at the top of the panel and copy the text below into the popup window.
“Bill Murray will appear in new Ghostbusters film: Dr. Peter Venkman was spotted filming a cameo in Boston this… http://dlvr.it/BnsFfS.”
Hit the Apply Changes button to save your work.
Now select the Documents to Data operator. In the Parameters panel, enter ‘my_text’ in the text attribute field.
Execute the process using the blue “play” button. The results show five extracted entities. As you can see, Rosette correctly extracted both the names and the location included in the text.
Let’s make our input text a little longer. Add the sentence below to the parameter text and rerun the process.
“Another original Ghostbuster, Dan Akroyd, is also confirmed to have a cameo in the film.”
From the results we can see that Rosette extracts Dan Akroyd’s name as expected. However, eagle-eyed readers may have noticed that “Akroyd” is misspelled. (It should be “Aykroyd.”) This is not uncommon. Name misspellings appear frequently, everywhere from personal blogs to the New York Times online. If you are trying to track a particular entity across a large collection of documents, you want to make sure that you are identifying all possible spellings of that entity’s name. Rosette automatically extracts and links entities with spelling variations and other textual anomalies, unifying them into a single entry.
To demonstrate this functionality, let’s enable Link Entities in the Extract Entities parameter panel.
Then, we’ll add a third line to the parameter text that includes the correct spelling of Dan Aykroyd’s name, like the one below:
“Actually, the correct spelling is Aykroyd.”
When we run the process again, a new QID column appears in the results. Notice that “Dan Akroyd” and “Aykroyd” have the same QID value — Rosette has correctly identified them as the same entity.
QID values are drawn from Wikidata, so if an entity has a Wikidata entry, Rosette should be able to link and resolve it.
QIDs are very useful for machine reading-purposes, but for humans they can be difficult to keep track of. Let’s turn on the Include Entity Name parameter, which will allow us to see the entity names in addition to their QIDs.
Try it Yourself
Now that you’ve got the Rosette Text Toolkit up and running with RapidMiner Studio, you are well equipped to handle a host of text analytics tasks. Incorporate results like the ones above into your pre-existing data processes, and check out our other operators, including Categorization, Sentiment Analysis, Morphological Analysis, Tokenization, Sentence Tagging, Name Translation, and Name Matching.
While you’re at it, keep us posted! We love to hear what our users are working on, and would be thrilled to share your Rosette for RapidMiner story on our blog and here in the RapidMiner Community.
When you do statistical based text analysis compared to natural language processing, sometimes words carry a different meaning when grouped together. For example, strategy by itself is just a noun there is no context involved. On the other hand, if it is paired with military, economic, or political it has a far different meaning. strategy, military strategy, economic strategy, and political strategy are all different ideas. Statistical based text processing will not extract the context of these words, but it will tell you how many times strategy, military, economic, and political shows up in your documents or data. This gives you information but lacks context. So the question is "How do you extract this context via statistical based text processing in RapidMiner?". The answer: Generate n-Grams (terms).
The operator’s algorithm is quite simple. Generate n-Grams (terms) will check for words that frequently follow one another. Following the example above, RapidMiner will pick out strategy and military as new attributes, each of which are words. Next it will say, often strategy is followed by military. Thus, it makes a new attribute military_strategy This selection can also be improved via pruning. There will be a footnote on pruning later. The result is now that there are 3 attributes: strategy, military, and military_strategy. Without having the machine understand the context, it was still able to identify grouped words in which the data scientist can now understand the context in which military strategy is related. The key here is that it bypasses the machines’ need to understand the language and pushes it onto the user while still maintaining groups of associated words within our data set.
- You will need the Text Processing extension in order to use both the process documents operator and the generate n-grams operator. There are a few other operators you will need from this extension as well.
- A guide to installing extensions including the text processing extension can be found here.
- Text document. You can either create one or call a text file. This guide will encompass the first.
Step 1: Generating the Document
All that is needed here is to place a create document operator. Then proceed to parameters window and open the 'Edit Text' button. This should pull up a window that will allow you to type or insert text to be pushed through the process documents operator for text processing. Any amount of text can be added but these 2 sentences should do the trick:
"The distinction between our strategy and theirs is that ours is a true military strategy. Theirs is a poor excuse of military strategy that can be summed up by hit and run tactics with a side of cowardice."
This sample text will utilize the example we talked about above. This will allow RapidMiner to find the connection between military and strategy.
Step 2: Process Documents
The next step is to add in the text processing operator. Once its in place, there is a sub-process icon in the bottom right which denotes that there is another level to the operator. Double clicking will bring you into that level. Once there, a tokenize operator is needed. Then the "Any non-letters" option in the parameters tab needs to be selected. These two operators in conjunction will say, 'generate a word whenever two non-letters are separated by letter characters. For example ' Space.', the empty space and the period are the two non-letters that generate the word space. Next, a transform cases operator set to lowercase is also needed. This will ensure that "The" and "the" are pulled out as the same word.
Step 3: Generate n-Grams
The last operator needed is generate n-grams (terms). Here, a max length of two is a sufficient setting for the length parameter. There is not a large amount of text to use so there is no need for a length larger than two. The last task required is to connect the wordlist and the example set out puts to the result nodes on the right. Once this is hooked up, the process can be ran by either pressing the run button or F11.
Notice that strategy, military, and military_strategy were all pulled out as unique words. This is the desired result. There are also term frequencies associated with each attribute.
On the process documents operator, there is a parameter for pruning. If this is set to absolute pruning and that is set to two, then RapidMiner will only keep words in the document that show up two times or more. It will cut down the final result and only show frequent engrams. For this case, a low prune generates the desired result but pruning can be an extremely tedious task once there are thousands of words being processed.
Rapidminer provides multiple ways to do sentiment analysis. A very commonly used and powerful solution for sentiment analysis is training a model based on historical information or training set and then building a predictive model using that. Historical information may be available if in the past certain content was manually coded into different sentiment values. If not one will have to do a preparation step where a good sample should be manually classified as positive or negative sentiment. This is a one time effort and having a good training set will lead to better models and better predictions.
Please use this example along with the provided sample process (Attached as a zip file with this article)
An example of training set we will use today is as seen below. (It is also attached in the zip file attached with this article)
The process to build a model using this would involve at least following operators
- Read Excel (To read the sample data)
- Nominal to text (To specify which column is a text column, since Rapidminer "Process Documents..." Operators work only on text data
- Process Documents from Data (This is the meta process for most text processing capabilities)
- Tokenize (This will be used to tokenize the content into words, n grams etc as needed)
The actual process will look like this for the processing of training text
Inside the "Process Documents from Data' operator we will have one step for the basic process, i.e Tokenize
We will later on work on improviing this sub process if needed.
The output of "Process Documents from Data" will be your tokenized exampleset as well as a wordlist.
Now we can build a cross validation step using our "Tokenized example set". We will also need to add the "Set Role" operator to specify our Label (i.e target) variable.
The process should look something like this.
To know more about validation, please look at these links
Inside the validation operator we can use any of the learners. For text mining use cases, Naive Bayes is many times good and fast. You can also try SVM or Neural Nets but that increases the computational complexity of the solution.
The validation step provides the model as well as information of performance of the model. "mod" provides the model and "ave" provides the performance.
In our case for the basic example when using Naive Bayes our accuracy confusion matrix looks like
When using SVM our confusion matrix looks like
We will explore in a later article on how to improve on text processing. But for now lets assume this a good model.
Now to use this predictive model we will basically do similar process on the actual data set and then apply the model on the tokenized dataset.
One addtional step we need to do is, pass the wordlist from the training "Process Document from Data" operator to the scoring "Process Document from Data"
You process will look something like this.
The output from the Apply model will have three special columns. as seen the screen shot below
Prediction(Sentiment) - Actual class
You can then add additonal text processing operators as needed in your use case to improve on your model
A sample detailed "Process Documents from data" with more pre processing will look something like below.
Please ensure that you do the same steps on the scoring side to get correct results. Using Building Blocks is helpful here.
We have updated the Web Mining extension. The new version can be found here:
This extension is one of the most frequently downloaded, and we have had lots of requests to update it. The development team replaced both “Crawl Web” and “Process Documents from Web” operators with new versions. This new extension supports binary content crawling, and it also supports crawling password protected pages via basic authentication. This will make it much more applicable to modern and more secure websites.
provides powerful text mining capabilities to work with unstructured data. Unstructured data comes in various format e.g short comments like tweets which can be analyzed in its entirety or long documents and paragraphs which sometime need to be broken into smaller units. The following articles provides techniques to split longer texts into smaller units of sentences. but the general concepts can be used for other use case too
Please find attached zip file containing working example that go along with the content here.
The extensions we are using in this example are
- Rapidminer Text Mining Extension (Download from marketplace or here)
- Rapidminer Web Mining Extension (Download from marketplace or here)
The data we are using in this example is a RSS feed of google news for S&P 500 index.
After dropping the columns except the Content and Id, the data looks like. As you will notice the data is also HTML rather than simple text, so we will need to handle it too.
Now lets look at the two possible ways to work on this split into sentences use cases
Method 1 (Using Tokenize into Sentences)
This method uses Rapidminer Tokenize operator to split into sentences. The actual process looks something like this.
As with most Rapidminer text processing the core logic happens inside the "Process Documents from Data"
Here is what the inside of "Process Documents" looks like
here is how the Tokenize settings look like
The output of this step will be new columns with each sentence in each of the documents.
At this point you have successfully split each HTMl document into sentences.
Depending on your needs this may be good, or additionally you may be able to use Rapidminer data prep capabilities to convert this to other formats as needed. The example provided use the De-Pivot Operator to arrange all the sentences into one column that cna be used for further processing.
Method 2 (Using "Cut Document" operator)
The top process will look something like this.
The inside of "Process Documents of Data" looks like below
The settings for the cut document looks like below
The regular region queries we are using are as below, It splits sentences not only based on the full stop, but also on some common conjunctions.
The output of this process witl look something like this.
Also an additonal advantage to this method is it allows more control on how the documents are split and also using the ID you can see the origin of the individual fragments.
Rapidminer textmining capabilities provide several methods for Sentiment Analysis. One of the popular methods when dealing with English text is using the wordnet dictionary and relevant operators from Rapidminer Wordnet Dictionary. This article gives an overview of doing sentiment analysis using Rapidminer and the Wordnet Dictionary.
You will also need the "Text Processing" Extension from here
You will need to download the wordnet dictionary from here
Setup steps for wordnet dictionary
The wordnet dictionary file is a file with extension "gz". You will need to use utility like 7Zip to extract it. Once you have the "WordNet-3.0.tar" file, you will unzip that further using the same 7Zip tool. You should then have a folder "Wordnet-3.0" with folders like dict, doc, include etc.
Once you have done this you should be ready to build a text mining process with Rapidminer and using the Wordnet Dictionary.
In the screen shot below we are searching twitter, then changing data type of the column we want to use for "text processing" and then passing the dataset(Exampleset) to "Process Documents from Data". You can replace the search twitter step with any datasource of your choice like database, excel files etc. If you would like to utilize files from a folder you can also use the "Process documents from files" or in case of email use the "process documents from mail store" operator
Then double click on the "Process documents from data" operator to build your text processing steps. You will add your standard text processing steps like tokenize, transform cases, filter stops words, filter tokens etc based on your specific needs. Then the two operators you need to get the sentiment score are "Open WordNet dictionary" and "Extract Sentiment(English) both coming from the Wordnet extension.
Configure the "Open Wordnet Dictionary" operator l
to select directory in the "resource type" parameter and then confugure the directory parameter to point to the ....\WordNet-3.0\dict folder
Please explore the additional help provided with the "Extract Sentiment(Dictionary)" operator to understand the various parameters.
You can also use tthe wordnet operators for Synonyms, Hyoernyms, Hyponyms to improve on your process.
This process adds a new column 'sentiment" that provides a numeric value for sentiment, Negative sentiment are scored less than zero and positive sentiments are code greater than zero.
One can use the sentiment score and "Generate Attributes" operator to flag documents as Positive, Neutral, Negative etc based on the actual score value itself
See the attached process for the complete example.
You can open the process in RapidMiner Studio using File(Menu) >> Import Process.
This article talks about a sample process to find word frequency in unstructured text mining.
The basic operators you need for building a process like this are
- Some datasource (In the example we are using Twitter, click here to see details about how to use twitter)
- Nominal to Text" . This is to change data type for the process document operator to work on. Please note that only the "Text" data type columsn are processed by the text mining extension.
- One of the "Process Documents..." operator depending on what your data source is.
- Tokenize (Splits documents into sequence of tokens)
Please see the "basic word frequency.rmp" file attached with this article to see a working example
Your process would look like
Inside the Process Documents from Data will look like
The output of this will look something like this. (Please notice that your words may appear different for the exact same process since it is actually getting the twitter data.The word frequency or the WordList output is delivered via the "Wor" port of the "Process Documents from Data" operator.
Total Occurences - Tell you how many times the word appeared across all the examples.
Document Occurences - Tells you the number of individual documents the word appeared in.
As you will notice in the output there are several unwanted words, or same words handled as two different words because of difference in cases, or there are commmon english words that you do not care about or some specific words that you may not be interested in. All of these cases can then be handled by enriching the steps taken in "Process Documents from Data". Your improved "Process Documents from Data" sub process may look somehting like below
Here are the reasons for using these operators
- Filter Stopwords(English) : This operator removes common english words like a, and, then..
- Transform cases :basically converting everything to one case i.e lower of upper
- Filter Tokens(By Length) : Removes word shorter than and longer than configured number of characters
- Filter Stopwords(Dictionary) : This operator provides the ability to drop certain words. The list can be provided by a simple text file with each words to ignore on a new line. See sample attached file
In the attached presentation, given at Wisdom 2016 by Elian Carsenat of NamSor, discusses text analytics extending to the automatic identification of geodemographics and gender to enhance segmentation.