Anomaly detection experiment
Im undertaking my final year project on machine learning for cyber security and am a complete beginner to RM. I wish to create a process that will demonstrate how effective machine learning techniques are for detecting both signatures and anomalies in an IDS, for this I am using the KD99 cup dataset for which i have labelled and unlabeled sets. the aim is to obviously create a classifier that will train from this data and be able to spot anomalies. I have downloaded the anomaly detection extensions but am also not too sure how to use them.
Additionally since the data is already labelled I would like to know if it would be better to have the results name the specific attack that happens (i.e smurf, SQLattack etc) or to simply output 'malicious' or 'benign' and how to do this.
yyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 362 RM Data Scientist
KDD99 is a widely used data set for anomaly detection. I would suggest to use binary labels as the good starting point (attacks or nomal) because there may not be sufficient cases in several categories of attack for multinominal classification. Watch out for the umbalanced classes.
I did a quick google search and some researchers had sumarized the accuracy of different learners in a paper.
They mentioned that in raw data, you may need to be carefull about the duplicated data.
You can build SVM, Decision Tree, Random Forest, Naive Bayes, GBT, etc. models in rapidminer for binominal classfication and evaluate the performances (AUC, accuracy, recall, F-measurement,...) for your own models. If interested in unsupervised learning algorithms, you may take a look a the outlier detection operators and anomaly detection extensions from Marketplace. For instance, LOF or HBOS. Also some cutting edge fraud detection algorithms are available (e.g isolation forest) by combining the power of any R/Python libraries.
I'd also suggest to build models for the different attacks vs. everything else.
As the properties of attacks are probably very different, some algorithms won't find them if they need to match too many properties. E. g. if DDOS attacks come with large bandwidth, but user enumeration attacks don't, many learners won't find this attribute as helping with general attacks, but they'll identify it for DDOS. (Just some examples from the top of my head.)
anomaly detection process. Based on what I already know in order for an anomaly system to work you need a data source - (KD99 dataset) a preprocessing stage - (process documents from data with embedded tokenization and transform cases to create TF-IDF word vectors) then a normal profile learning phase (so rule building etc. but not sure what operators would work on this dataset) then finally something to detect the anomalies.
Ive already installed the anomaly detection extensions which as I understand houses a variety of algorithms in it already but am not really sure how to implement its operator i just keep getting errors and its really frustrating.
For metrics I want to see the rate of false positives and the number of attacks actually detected. The data is labelled normal or attack but i also have just normal unlabelled data as well - every time i use this however it always asks me to put a special attribute in.
Any help would be greatly appreciated.
Thanks for your feed back @fwood201! The anomaly detection extension has a bunch of "unsupervised" learning algorithms that generate anomaly scores for the input numeric data.
If you run a "supervised" learning method for classficiation, you have to specify which attribute is your prediction target (in rapidminer, we call it "Label" for the ground truth).
In "Set role" operator, you candefine the special attributes. Attributes contain information about your example. Some types of information are special, providing information not suitable to be used as learning input. This could for example be the real label, found by humans for this particular example. You dont want to use the real label as input variable for learning, otherwise the result will be pretty simple: Examples of Label A get Label A. So special attributes are not used for learning.
The type now defines their role:
- The Id attribute is used for identifying examples
- The label attribute is used to store the real label
- The weight is used to give an example a weight, if it is very important. Learner then will give this example more attention to predict this example correct.
- Cluster attribute stores the information which cluster this example had been assigned to
- prediction attributes will store a prediction performed by a model applier or something else.
The other special types are very...special and only used in a few applications. You might ignore for now.
Can you share your process? Maybe we can inspect the errors.
Here it is. I also tried using the 'outlier detection' template from the samples section and inputted the unlabelled KD data into it. The process is currently running over 16 hours... is this normal or is it not working?
Something doesn't look right on the Process Documents from Data operator. Do a few things. Toggle on pruning on the Process Doc operator and double check that the string values you are feeding into it are in the RapidMiner text data format. You might need a Nominal to Text operator to conver tthat.