Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Best practicies?
Good day!
I've at the end found RapidMiner -- the software I was looked for. Declarative approach for editing configuration is great! I'm mostly a programmer, but want to recall my math back
I'd like to ask is there are some best practicies for the case I have:
I have 116 examples, each with one numeric label (or nominal class) and 67 attributes (float values).
The label is a country rating derived from experts and the attributes are country's
activities like 'Extent of business Internet use', 'Internet users etc' etc.
The goal I want to achive:
1. Train model on that data and get some error estimations.
2. Exam model using hand-generated values to get answers
to the questions like -- what it should be if we increase
the attribute 'Extent of business Internet use' and decrease some other.
Is this really possible?
I perform cross-validation for kNN and SVM learners and get accuracy around 66% (in case of nominal labels).
How can I try to achive better accuracy? May be by feature selection or any other data preprocessing?
Are there any practicies for such case?
I've at the end found RapidMiner -- the software I was looked for. Declarative approach for editing configuration is great! I'm mostly a programmer, but want to recall my math back
I'd like to ask is there are some best practicies for the case I have:
I have 116 examples, each with one numeric label (or nominal class) and 67 attributes (float values).
The label is a country rating derived from experts and the attributes are country's
activities like 'Extent of business Internet use', 'Internet users etc' etc.
The goal I want to achive:
1. Train model on that data and get some error estimations.
2. Exam model using hand-generated values to get answers
to the questions like -- what it should be if we increase
the attribute 'Extent of business Internet use' and decrease some other.
Is this really possible?
I perform cross-validation for kNN and SVM learners and get accuracy around 66% (in case of nominal labels).
How can I try to achive better accuracy? May be by feature selection or any other data preprocessing?
Are there any practicies for such case?
0
Answers
Uuh, this data situation is quite hard. Here some thoughts and ideas:
1. For Validation I recommend to not simply run CV once, but at least 10 times with different random_seed to get a more reliable result.
2. As far as I understand your goal ,you want to learn more about the influences of the different attributes on the label rather than classify new data (Is this correct ?). For this task I suggest:
- Use what we call a symbolic classifier like a Decision Tree or a Rule Learner. The resulting model is much easier to understand by humans.
- Calculate Attribute Weights, I recommend InfoGainRatio, to learn which "decision power" each attribute has regarding the label. But be careful to remove too many features. Hence you got such a small amount of data, you risk to overfit the data, e.g. it will generalize poor on new unseen dataset of the same domain
- Try some cluster algorithms to see if there are any structures in the data you have not seen before. Separating the data according to each cluster and analyse them each may result in new insights.
I am just curious: In my answer I have assumed that your labels has around 10 different values. Is this correct ? How many different values does the label consist of ?hope this was helpful
Steffen
and muliple types of clasifications and numerical labels.
But for now I want to achive some results at least for one nominal labels set and one numerical. So I have two cases:
nominal label + numerical 67 attributes
numerical label + numerical 67 attributes
in both cases attributes are the same. Will it work good for my 67 float attributes? What is the prefered Learner if labels are numerical? Yes, you are completely right, is there any way to estimate that risk? For the case where label is nominal I have 5 classes (different labels). Is this too small?
Among trees I prefer the classic C4.5 algorithm, called "Decision Tree" in RapidMiner and J-48 in Weka (which is also available within Rapidminer). For numerical label, try Regression Models => operator "KlassificationByRegression" and a Regression Operator of your choice as inner operator (Linear, Logistic..). Having such a small amount of data, I would try which one works best. The estimate is done by validation. First, the higher the variance of your accuracy estimated via CV is, the less likely is the generalisation power of the model. Second,another method of estimating the generalisation power goes like this:
Sample a part of the data before doing anything, for example 20 objects. Then use the rest of the data to create whatever fancy model you like, perform feature selection et cetera, validate via CV to gain a first estimate of accuracy. Then, after you have created the final model you think it is best, apply it to the sample. This will lead to an estimate of how accurate your model will be on unseen data.
Beside this I think that there are methods to calculate the error directly for regression models, but I do not know how and whether this is possible in RapidMiner. No, No. I was just guessing the number. Indead, a smaller number of classes is better (if there are still enough examples for every class) since it increases the amount of information available for each class.
Some words at the end:
As I mentioned above. Since you want to learn more about the current data set, I recommend to...
- calculate attribute weights
- look at each class separately to check the distributions of the important attributes
- calculate the Correlation Matrix
- perform cluster analysis to see if there are more structures...
Long text, gotta go to sleepHope I could help you, in fact I am still a struggling student, but try to share what I already know
Sometimes I feel like the little doorman, catching up the less complicated questions and keep the visitors busy until the great Dark Master of Data Mining (aka Ingo and Tobias) arrive bringing real wisdom
greetings
Steffen
I finally figured out that for my task (find hidden relations in data) the cycle of learning and exam is not the only one method.
Ahother stupid question, does this procedure make sense?
- I take an unlabeled dataset of some integral data about countries (some economic, ecological ratings) -- about 4 numeric attributes
- perform cluster analysis (for example using KMeans)
- give clusters descriptive names (for example, 'the countries I like', 'the countries I would like to live in', 'bad countries' etc...)
- use this cluster's names as labels to other unlabeled dataset. This dataset should have the data about countries I would like to investigate (the countries are the same as for ratings!)
- perform Decision Tree learning to get information gain ratios and visualization too see what attributes are important in making a country 'good'
As far as I see, Yes !
However, did I get this right: You want to create a classifier using the clusternames as labels for training ? This is a widely used strategy, you are on the right track !
I want to remark: Be careful with KMeans. KMeans finds the number of clusters you order him to look for. Not less, not more. There are no stupid questions !
greetings
Steffen
In other words: I'd like to determine which countries I should consider good using expert's ratings and than try to answer the question 'why are they so good or so bad' using objective numerical data as attributes and the labels (cluster names) obtained on the first stage. I'm consider using using decision trees (thanks again! ). As far as I understand they using information gain ration, so the most significant attributes will be clother to the root.
Does it sounds meaningful? Yes, that's a gotcha. Will it common to reduce attributes dimension to obtain vizualization? Will it help in understanding the data? I just feel as complete beginner in Data Mining and I don't really like to be it
http://databionic-esom.sourceforge.net/
If you got annoyed by converting your data to the lrn format (http://databionic-esom.sourceforge.net/user.html#Data_files____lrn_), there is another implementation in RapidMiner doing the same thing (called SOM). (I cannot suggest this one, since the ESOM was created at my home university :P)
Regarding cluster validation: Here is discussion about this (Rapid-i:Universal Cluster Validation).
hope this was helpful
Steffen
I can't find one feature in RapidMiner -- I can't attach descriptive text titles to points beeing visualized in 3D. Is this really possible?
sorry to not contributing to this as I find great discussion but I simply have to get offtopic for a second:
@Steffen:
Do you know Fabian Mörchen then?
Cheers,
Ingo
Another off-topic question: if I'm interested in discovering dependencies between attributes and class labels, what options can I try except calculating information gain ratio and regression methods?
- Covariance-/Correlationmatrix to compare numerical values
- TransitionMatrix to compare nominal values (discretize the interesting attributes before)
- Looking at different standard plots (scatter,histograms)(colored with label) to look manually for patterns (since you know now which attributes are primarily interesting)
If you have some domain knowledge (or: at least you know what the different attributes mean combined with some common sense) it will ease the process.hope this was helpful
Steffen
Thank you for all the help!