Data mining

andre5007 · April 2021

Hi guys, I was doing a job but I found a problem and I don't know how to start, I'm really new to using the rapidminer, and I would like to know if anyone could help me.
I have to estimate Feature 8 which is the number of maintenance interventions the device has had. What can I do?
Thanks
André

yyhuang · April 2021

Association and correlations are not meant for predictions. They are more like “descriptive” models. What is your purpose here?

kayman · April 2021

Could you share a bit more information, like a current input format example and the expected/wanted outcome?

andre5007 · April 2021

Hey Kayman, I hope everything is fine with you.
I have these two csv, in which both csv have several feats.
Feat1- model, Feat2-power measure, Feat3- is something that this object has or does not have, being 1 has and 0 does not, Feat4 is a feature that I don’t know what it is, Feat5- device installation date, Feat6 / 7- It is the latitude and longitude and feat 8 is the number
maintenance interventions.
In the CSV Training I have values for feat 8 and in the Test no.
My goal is to estimate the Feat 8 for the Test set.
How can I do this?
If you need more information please tell me 
Thanks

Telcontar120 · April 2021

I believe this is a duplicate thread. I have responded in the other question. You should probably resolve one, duplicate threads are not generally helpful for the community.

yyhuang · April 2021

Hi @andre5007,

I worked on your training data a bit to build regression trees based on clean features. The predictive model performs pretty good with 10-fold cross validation. RMSE is as follows

Image: https://us.v-cdn.net/6030995/uploads/editor/1c/tydqywt4hmvx.png

My process attached for your reference.
Cheers,
YY

andre5007 · April 2021

I was testing the process you did and started giving the following error.
What I did was when I import the CSV I put all fields as nominal and the problem was overcome, was that the solution?

Then gave this error.
But this one I couldn't solve, do you think you can tell me what to do?

andre5007 · April 2021

How do you import de csv?

When you import the csv do you make any changes?
Thanks
André

yyhuang · April 2021

Hi @andre5007, You have to run the sub-process for data loading first. To run it, you just right click to enable, and make sure this data loading step will be executed before the modeling.

Image: https://us.v-cdn.net/6030995/uploads/editor/m3/rq6pto3c6syg.png

I used the csv files from you in another thread. They are attached here as well.

Cheers,
YY

andre5007 · April 2021

Hi @yyhuang,

Now I get it, yes.

Sorry for so many questions because I'm really quite new to this area.

Could you explain to me for example why the ID's are like this '?' ?

Image: https://us.v-cdn.net/6030995/uploads/editor/fe/718h6qu7tz7u.png

This way I can understand what?

Image: https://us.v-cdn.net/6030995/uploads/editor/x8/bxeep1uxrddk.png

yyhuang · April 2021

Oh, good catch. Thanks @andre5007. Firstly, the id name is messed in the data loading step, which usually happened in read csv. You can fix the column names with "rename" operator. Later the id values get lost after applying grouped model due to a potential bug (grouped model with the target encoding). To walk around that, you can keep the id as regular attribute before scoring, and set id as special role after scoring, which is shown in the attached process.
Ps. the feat1 could potentially result in some data leakage if we apply target encoding on such categorical attributes with soo many values. I don't have the context here but you can try to drop it by configuring "Target Encoding".

Image: https://us.v-cdn.net/6030995/uploads/editor/cq/zyr5vvpk79kz.png

Pps. you can round up the predictions after scoring if you prefer to integers.
HTH!

andre5007 · April 2021

I've been studying a little bit and I've been looking at the best models I could use and I realized that at first I was using the classification model.

At this point you used the regression model from what I understand.

But after the study a doubt arose, wouldn't the best model for my problem be to use the associations & correlations model, because that model answers questions like "what happens together? What changes together?", while the regression one answers more to questions like "how much or how many? How many will happen?".

I say this because from what I understand each feat has an influence on the result, for example feat 3 is something that the object has or not (1 has 0 has not) and if it is something that can influence the result of feat 8 from what I understand (which is very little) for example in the process you did this was not taken into account right?

Sorry for the question but I'm really just trying to understand as best I can.

Thanks
André

yyhuang · April 2021

What are you trying to predict here? Could you explain the meaning of feat1-feat8?

andre5007 · April 2021

I have two csv's, where both csv's have various feats.

Feat1- model, Feat2-power measure, Feat3- is something that this object has or does not have, being 1 has and 0 has not, Feat4 is a feature that I do not know what it is, Feat5- installation date of the device, Feat6 / 7- is the latitude and longitude and Feat 8 is the number

maintenance interventions.

In CSV Training I have values for feat 8 and in Test no.

My goal is to predict Feat 8 for the Test set.

I was told I could use rapidminer for this job.

I was also told that in order to know which model I would use I would first have to relate the feat's.

For example by doing this I can tell that there is an outlier.

How can I do this?
I hope it makes sense

If you need more information, please let me know

Thanks
André

yyhuang · April 2021

Good progress on the detection of outliers by GPS coordinates @andre5007! You can use the visualization plots, and statistical distribution to identify the outliers and exclude them from training.

Image: https://us.v-cdn.net/6030995/uploads/editor/pr/dgn32x6vtoq0.png

Image: https://us.v-cdn.net/6030995/uploads/editor/a3/of8bzqm8wdwb.png

According to your definition, the model is predicting " Feat 8, which is the number of maintenance interventions."
I will stick to the regression models (KNN, regression tree, Random Forest, GLM, GBT are good choices for regression) because you will predict a numerical target. If the target is categorical, saying true/false, broken/normal, then go classification.

Besides visualization for data exploration and outlier detection, you can also use some of the outlier detection models (e.g. Tukey test for exponential distribution... )

Image: https://us.v-cdn.net/6030995/uploads/editor/xm/d1id9mdy86ha.png

andre5007 · April 2021

Hi @yyhuang
I fully understand why you use the regression method, why the classification method is not the best, but I was kind of at a loss as to why you for example don't use the associations & correlations method is there a reason?

By this I am not saying that I doubt your knowledge, as I believe it is much more than mine.

Thanks for your attention

André

andre5007 · April 2021

The attached dataset records a total of 30554 "observations" - 15277 grouped in the Training set and 15277 grouped in the Test set.

This same dataset intends to extract the dependency that may exist between the "independent" characteristics (Features 1 to 7) and the number of maintenance interventions that the device has had (dependent variable, Feature 8) since it was put into operation (Feature 5).

The dataset refers to various types of devices that are considered to operate under the same conditions except for those features recorded in the table. Features 6 and 7 refer to the location - Latitude and Longitude - in which the device operates.

I now have to perform all the Data Mining steps that allow me to estimate Feature 8 for the Test set, and get its 'estimation vector' and from what I understand and what I've been looking for I can't get that from rapidminer.

Thanks

André

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Data mining

Best Answer

Answers