Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Rule-based approach to text mining
Dear Community,
I work in Food and Pharma Industry and I would like to learn how to extract information from scientific papers. I have no experience in text mining but I tried to understand the capability of Rapid Miner. I familiarised myself with the text mining operators including Rosette tools and some relevant processes kindly shared by members of this community.
I came across the attached article that presents a nice method on extracting characteristics of epidemiological studies such as study design, population, exposure and outcome. Can Rapid Miner do a rule-based text mining similar to the one described in the paper?
Thank you very much in advance.
Beata
I work in Food and Pharma Industry and I would like to learn how to extract information from scientific papers. I have no experience in text mining but I tried to understand the capability of Rapid Miner. I familiarised myself with the text mining operators including Rosette tools and some relevant processes kindly shared by members of this community.
I came across the attached article that presents a nice method on extracting characteristics of epidemiological studies such as study design, population, exposure and outcome. Can Rapid Miner do a rule-based text mining similar to the one described in the paper?
Thank you very much in advance.
Beata
0
Answers
1. RE Process the abstracts - one abstracts one pdf OR all abstracts in one pdf/word? I understand the abstract don't have to be in Excel.
2. What do you mean by 'reduce dimensionality'? Does it mean I would have to apply more filters; can you please suggest some?
3. So once I have an example set, I need to do classification and then either:
option A - decision tree
option B - clustering
I appreciate there are other options as well, but I will focus on what you suggested (I am a complete newbie, and need to learn more about text mining)
4. Could you share some relent processes that I could base on OR draft/provide an order of operators to use OR provide a simplified example process similar to what I want to achieve. I am not expecting a ready to use solution, I just don't know how to do the step 3 above. I can only confidently (I think) do steps 1 & 2 above.
Many thanks
If however you exported abstracts as lots of separate text files then plain text would be easiest to deal with, Excel is also fine. I am not sure how successfully you will be able to process PDF (Text Processing extension can do so) as some PDFs are images so the text may be difficult to extract. If you have a directory full of text files, you can use Loop Files operators to read them all in and you can Append them all into a single example file. If your aim is classification then each directory may represent a different class.
Once you Process Documents (e.g. "from Data" as by that stage all your abstracts will be in an example set), you will end up with TF-IDF representation (most likely), in which every (standard) term will be used as a new attribute, you may have 10,000 of such attributes and the representation will be very sparse (mainly zeros). So it is a good idea to try reducing the number of attributes to something manageable. One way is to use percentual pruning of the Process Documents from Data (e.g. prune those terms which appear in less than 3% or more than 40% of documents) - it is a good first cull of attributes. If you are classifying and you have a defined nominal label, use Weight by Information Gain and then Select by Weight to reduce the number of attributes further. There are great many weighing operators, also for numerical labels. If you do not have a label and are clustering your abstracts, then you can always use PCA or SVD to shrink the numbers. This is commonly referred to as dimensionality reduction.
Check out the example of RM text classification in RapidMiner Academy, see:
https://academy.rapidminer.com/learn/video/automatic-classification-of-documents
I'll see if I have some example to post in here.