RapidMiner

RapidMiner

Splitting text into sentences

by RMStaff ‎07-03-2016 10:02 AM - edited ‎07-13-2016 05:14 AM

Rapidminer provides powerful text mining capabilities to work with unstructured data. Unstructured data comes in various format e.g short comments like tweets which can be analyzed in its entirety or long documents and paragraphs which sometime need to be broken into smaller units. The following articles provides techniques to split longer texts into smaller units of sentences. but the general concepts can be used for other use case too

 

Please find attached zip file containing working example that go along with the content here.

 

The extensions we are using in this example are 

The data  we are using in this example is a RSS feed of google news for S&P 500 index.

After dropping the columns except the Content and Id, the data looks like. As you will notice the data is also HTML rather than simple text, so we will need to handle it too.

original data.pngNow lets look at the two possible ways to work on this split into sentences use cases

Method 1 (Using Tokenize into Sentences)

This method uses Rapidminer Tokenize operator to split into sentences. The actual process looks something like this.

 

cut documents basic process.png

 

As with most Rapidminer text processing the core logic happens inside the "Process Documents from Data"

Here is what the  inside of "Process Documents" looks like

tokenize sentences.png

here is how the Tokenize settings look like

tokenize settings.png

The output of this step will be new columns with each sentence in each of the documents. 

tokenize output.png

At this point you have successfully split each HTMl document into sentences.

Depending on your needs this may be good, or additionally you may be able to use Rapidminer data prep capabilities to convert this to other formats as needed. The example provided use the De-Pivot Operator to arrange all the sentences into one column that cna be used for further processing.

 

Method 2 (Using "Cut Document" operator)

The top process will look something like this.

 

simpleprocess.png

The inside of "Process Documents of Data" looks like below

cut documents.pngThe settings for the cut document looks like below

cut document settings.png

The regular region queries we are using are as below, It splits sentences not only based on the full stop, but also on some common conjunctions.

 

settings for region.png

The output of this process witl look something like this.

 

split documents 07132016.png

 

Also an additonal advantage to this method is it allows more control on how the documents are split and also using the ID you can see the origin of the individual fragments.