Good afternoon everyone,
I have the RapidMiner community edition installed. I wish to use this software in order to scan a set of plain text documents to see which of those documents contain source code (in the form of reserved words). I imagine this is something that I could do with a Support Vector Machine, but I am not sure how I would implement this in RapidMiner. Could anyone give me a point in the right direction? Thank you.
Sure, here's a conceptual approach to what you would need to do.
As I said, this is a fairly conceptual workflow but it should cover all the basics you need to tackle your problem.
Thank you very much for your post. I just wanted to clarify a few things about your answer. The documents I have are a mixture of source code and normal plain text. What I want to be able to do is automatically categorise those files which contain at least some source code. Presumably this will look for things like reserved words and so forth. Do I need to add any document tags beyond just adding a binomial label of contains source code/doesn't contain source code? Thank you.
As long as you present it with the properly labeled cases for training, the model should be able to figure out which words (tokens) are characteristic of source code and which ones are not. So you will need to do the text preprocessing I describe, but not anything else in terms of telling the model explicitly which tokens are associated with source code. If you did that, you would be using a deterministic approach (a series of rules) rather than a machine learning algorithm.