Rule-based approach to text mining

BeataP · December 2020

Dear Community,
I work in Food and Pharma Industry and I would like to learn how to extract information from scientific papers. I have no experience in text mining but I tried to understand the capability of Rapid Miner. I familiarised myself with the text mining operators including Rosette tools and some relevant processes kindly shared by members of this community.

I came across the attached article that presents a nice method on extracting characteristics of epidemiological studies such as study design, population, exposure and outcome. Can Rapid Miner do a rule-based text mining similar to the one described in the paper?

Thank you very much in advance.
Beata

jacobcybulski · December 2020

@BeataP you can develop a similar approach in RapidMiner. First, you'd have to process the abstracts, which you may have collected in a folder. There is a number of ways to do so, one would be to scan the folder for text files and process their text, i. e. change all text to lower case, tokenise text, stemming to find the word root, removing stop words, etc. As a result, each word becomes an attribute of your new representation. You may end up with 1000s of such attributes, and often you may need to reduce dimensionality of your representation to have fewer words to deal with. The new examples can then be used to build a predictive model, e. g. your abstracts may be classified into several categories, which you may need to provide with text, e. g. by placing each abstract into a different sub-directory. There are many models which could be used for this, e. g. decision trees, from which you can generate rules if wished to. Another possibility is to create a clustering model which could be used to generate groups of your abstracts. So yes, there are many tools in RapidMiner to help you out in the analysis of medical abstracts.

BeataP · December 2020

Thank you @jacobcybulski for your quick reply|! Much appreciated. It is great to hear that RM can do it. I have a few questions (sorry they are all basic):

1. RE Process the abstracts - one abstracts one pdf OR all abstracts in one pdf/word? I understand the abstract don't have to be in Excel.

2. What do you mean by 'reduce dimensionality'? Does it mean I would have to apply more filters; can you please suggest some?

3. So once I have an example set, I need to do classification and then either:
option A - decision tree
option B - clustering
I appreciate there are other options as well, but I will focus on what you suggested (I am a complete newbie, and need to learn more about text mining)

4. Could you share some relent processes that I could base on OR draft/provide an order of operators to use OR provide a simplified example process similar to what I want to achieve. I am not expecting a ready to use solution, I just don't know how to do the step 3 above. I can only confidently (I think) do steps 1 & 2 above.

Many thanks

jacobcybulski · December 2020

If your abstract database can export abstracts in a single comma/tab separated file, with the names of authors, titles, publisher, dates, and an abstract (without new lines) then this would be ideal as all you have to do is to read such a file in using Read CSV operator and then process the abstract as text in one of the fields. For example, in the past I used a free reference management system Zotero to export all my references and abstracts using my custom made bibliographic format and ended up with a nice formatted CSV file with embedded abstracts inside. RapidMiner can also read BibTeX format!

If however you exported abstracts as lots of separate text files then plain text would be easiest to deal with, Excel is also fine. I am not sure how successfully you will be able to process PDF (Text Processing extension can do so) as some PDFs are images so the text may be difficult to extract. If you have a directory full of text files, you can use Loop Files operators to read them all in and you can Append them all into a single example file. If your aim is classification then each directory may represent a different class.

Once you Process Documents (e.g. "from Data" as by that stage all your abstracts will be in an example set), you will end up with TF-IDF representation (most likely), in which every (standard) term will be used as a new attribute, you may have 10,000 of such attributes and the representation will be very sparse (mainly zeros). So it is a good idea to try reducing the number of attributes to something manageable. One way is to use percentual pruning of the Process Documents from Data (e.g. prune those terms which appear in less than 3% or more than 40% of documents) - it is a good first cull of attributes. If you are classifying and you have a defined nominal label, use Weight by Information Gain and then Select by Weight to reduce the number of attributes further. There are great many weighing operators, also for numerical labels. If you do not have a label and are clustering your abstracts, then you can always use PCA or SVD to shrink the numbers. This is commonly referred to as dimensionality reduction.

Check out the example of RM text classification in RapidMiner Academy, see:
https://academy.rapidminer.com/learn/video/automatic-classification-of-documents

I'll see if I have some example to post in here.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Rule-based approach to text mining

Answers