RapidMiner

Text filtering

Contributor

Text filtering

Dear All,

I am new to RapidMiner and have an issue where I do not really know how to start it:

I have the following data:
    - One file (pdf, txt or html) with a collection of 1000 different news articles.
    - A list with about 30 keywords.
I want to extract all those articles, that match at least with one of the keywords.

My questions are:
1. What do I have to do such that RapidMiner can distinguish where an article starts and ends? When I import my news articles with the operator „Read Data“ it seems to me that the whole data is considered as „one article“.

2. What kind of process do I need to set up to extract only those articles that contain one of the key words. Specifically, which operator would work best? I tried „Filter Documents (by content)“ but I don’t understand where I should integrate my keywords.


Thank you so much!

Best,
Carl

3 REPLIES
Community Manager

Re: Text filtering

Hi Carl,


Did you get a chance to read through this part of the Community: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/tkb-p/Text

 

 

If all your documents are in one file and you want to seperate them, you will need to use the Cut Document operator to slice them into seperate entities. 

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Community Manager

Re: Text filtering

Hi Carl,


Did you get a chance to read through this part of the Community: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/tkb-p/Text

 

 

If all your documents are in one file and you want to seperate them, you will need to use the Cut Document operator to slice them into seperate entities. 

Regards,
Thomas - Community Manager
LinkedIn: Thomas Ott
Elite III

Re: Text filtering

After you have dealt with the separation of the documents as @Thomas_Ott describes, you will next probably want to process the documents and create a word vector.  In your case, binary term occurrences may be helpful, since that will create a simple 0/1 indicator for each token (in your case probably individual words, although you can also do n-gams for phrases of more than 1 word) and then cross-reference that to identify which documents contained any of the key terms.  You may also need to do some token replacement or stemming if you have synonymous terms or variations, but it should be fairly straightforward.

 

Brian T., Lindon Ventures - www.lindonventures.com
Analytics Consulting by Certified RapidMiner Analysts