Text filtering

caknoblacaknobla Member Posts: 1 Contributor I
edited November 2018 in Help

Dear All,

I am new to RapidMiner and have an issue where I do not really know how to start it:

I have the following data:
    - One file (pdf, txt or html) with a collection of 1000 different news articles.
    - A list with about 30 keywords.
I want to extract all those articles, that match at least with one of the keywords.

My questions are:
1. What do I have to do such that RapidMiner can distinguish where an article starts and ends? When I import my news articles with the operator „Read Data“ it seems to me that the whole data is considered as „one article“.

2. What kind of process do I need to set up to extract only those articles that contain one of the key words. Specifically, which operator would work best? I tried „Filter Documents (by content)“ but I don’t understand where I should integrate my keywords.


Thank you so much!

Best,
Carl

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Hi Carl,


    Did you get a chance to read through this part of the Community: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/tkb-p/Text

     

     

    If all your documents are in one file and you want to seperate them, you will need to use the Cut Document operator to slice them into seperate entities. 

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Hi Carl,


    Did you get a chance to read through this part of the Community: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/tkb-p/Text

     

     

    If all your documents are in one file and you want to seperate them, you will need to use the Cut Document operator to slice them into seperate entities. 

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    After you have dealt with the separation of the documents as @Thomas_Ott describes, you will next probably want to process the documents and create a word vector.  In your case, binary term occurrences may be helpful, since that will create a simple 0/1 indicator for each token (in your case probably individual words, although you can also do n-gams for phrases of more than 1 word) and then cross-reference that to identify which documents contained any of the key terms.  You may also need to do some token replacement or stemming if you have synonymous terms or variations, but it should be fairly straightforward.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • jana_janarthanijana_janarthani Member Posts: 1 Contributor I

    hai dear all,

     I'm new for RapidMiner. I need help from u. I need take keywords from one news. then i have to compare with other newses for take best news. 

    can you help to me?

     

    thank you.

    prasanth

     

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hello @jana_janarthani - welcome to the community.  That's not exactly a question defined enough for us to answer here.  May I suggest you begin by look at the library of support materials that we have? 

     

    https://community.rapidminer.com/t5/Getting-Started-Forum/Essential-RapidMiner-Resources-for-New-Users/m-p/41212#M825

     

    Scott

     

     

Sign In or Register to comment.