Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
Best approach to use for some data analysis??
neilduggan
Member Posts: 18 Contributor II
in Help
Hi
First post on here – I’m looking for some direction on what analysis method to use for my thesis. I hope this post isn’t too long-winded and / or in the wrong location!!
I'm pretty new to RapidMiner and data analytics in general but I wanted to check if the approach I'm considering seems suitable based on the data types I have or is there something different which would be worth exploring before I start going too deep into it?
My thesis is essentially an analysis of a dairy farm model (grass growth & grass consumption) which outputs whether the farm has a deficit / balanced supply / surplus of grass & silage production during the year.
I am trying to assess whether it is possible to predict early in the year what the outcome will be for the entire year and if so, whether it will be possible to mitigate against a deficit or surplus.
I plan to categorically label my data set as below such that each data point will have the following associated with it:
- Month name: 1-12
- Month status: Deficit / Balanced / Surplus
- Year status: Deficit / Balanced / Surplus
- Rainfall associated with the Nitrogen application date in each month - 5 labels from Wet to Dry
- Temperature associated with the Nitrogen application date in each month - 5 labels from Wet to Dry
From this, I want to assess:
1. Whether there are Rainfall & Temperature combinations in the early months which can reliably predict a deficit / surplus later in the year? Does association rule mining sound like it would be suitable for this??
I was planning to set Year status as the label and combining the attributes for Month/Temp/Rainfall into a single attribute (or should these be left as individual attributes?) and see if I can come up with association rules for these? Or would something like decision tree or Apriori be better?
Then I'm going to go back to the model and for deficit and surplus years, I am going to adjust the amount of nitrogen applied (by a percentage based on how severe the surplus / deficit f) and see if it can bring the farm back to a balanced situation. This will generate new data for the deficit & surplus years with an additional label (again categorical) of how much the nitrogen adjustment is.
The second thing I want to assess is:
2. If certain "deficit" or "surplus" weather conditions occur early in the year, can adjustments be relied upon to bring the year back into balance? Again, does association rule mining sound like it would be suitable for this??
Ultimately, the output I'd like to generate is something like:
If "Temperature = Cool to average" and "Rainfall = average" Then "increase nitrogen rate by 10%"
Apologies for the long and detailed mail, I hope it makes sense.
Any direction on other techniques which would be worth considering would be greatly appreciated
Thanks
Neil
First post on here – I’m looking for some direction on what analysis method to use for my thesis. I hope this post isn’t too long-winded and / or in the wrong location!!
I'm pretty new to RapidMiner and data analytics in general but I wanted to check if the approach I'm considering seems suitable based on the data types I have or is there something different which would be worth exploring before I start going too deep into it?
My thesis is essentially an analysis of a dairy farm model (grass growth & grass consumption) which outputs whether the farm has a deficit / balanced supply / surplus of grass & silage production during the year.
I am trying to assess whether it is possible to predict early in the year what the outcome will be for the entire year and if so, whether it will be possible to mitigate against a deficit or surplus.
I plan to categorically label my data set as below such that each data point will have the following associated with it:
- Month name: 1-12
- Month status: Deficit / Balanced / Surplus
- Year status: Deficit / Balanced / Surplus
- Rainfall associated with the Nitrogen application date in each month - 5 labels from Wet to Dry
- Temperature associated with the Nitrogen application date in each month - 5 labels from Wet to Dry
From this, I want to assess:
1. Whether there are Rainfall & Temperature combinations in the early months which can reliably predict a deficit / surplus later in the year? Does association rule mining sound like it would be suitable for this??
I was planning to set Year status as the label and combining the attributes for Month/Temp/Rainfall into a single attribute (or should these be left as individual attributes?) and see if I can come up with association rules for these? Or would something like decision tree or Apriori be better?
Then I'm going to go back to the model and for deficit and surplus years, I am going to adjust the amount of nitrogen applied (by a percentage based on how severe the surplus / deficit f) and see if it can bring the farm back to a balanced situation. This will generate new data for the deficit & surplus years with an additional label (again categorical) of how much the nitrogen adjustment is.
The second thing I want to assess is:
2. If certain "deficit" or "surplus" weather conditions occur early in the year, can adjustments be relied upon to bring the year back into balance? Again, does association rule mining sound like it would be suitable for this??
Ultimately, the output I'd like to generate is something like:
If "Temperature = Cool to average" and "Rainfall = average" Then "increase nitrogen rate by 10%"
Apologies for the long and detailed mail, I hope it makes sense.
Any direction on other techniques which would be worth considering would be greatly appreciated
Thanks
Neil
0
Answers
is this all the data you have? It feels a bit like there could be more data measurement which might be useful to predict. In general 2-5 attributes are really not that much. You can easily handle a few dozens of variables.
In general i would NOT use associating mining for this. What you have is a supervised categorial learning problem. You can treat that with a lot of different algorithms (like Random Forest, SVM or Neural Net) and need to find out what works best for you.
Have you thought about using a time series like model? You would create a data set like
#Deficits Last Three Months
#Rainfaill==Wet Last Three Month
etc. Maybe you can even add stuff like Average(Rainfall) if you can assign useful numbers. Then you would predict "Is there a deficit in 3 Three month"?
I am not sure how this works together with your Month/Year deficit variables, but i think it is worth the thought.
Dortmund, Germany
Thanks for your response, much appreciated.
I have a few additional attributes e.g. day of application, N application rate, but I am trying to make the outputs of the analysis pretty generic so that they can be applied to a wide range of applications if possible. I thought that the best way to do this would be to deal in as few variables as possible – in this case time (month) and weather (temp & rainfall). Additionally, as the data being analysed is coming from a modelled dairy farm, it’s quite a slow-moving system so I wanted to look at the effect of macro factors like the weather. In the second part of the analysis, I will consider the Nitrogen application rate increase / decrease.
I will read up some literature about Random Forest, SVM & NNs, maybe run some trials on these and see how it goes. I’m sure I’ll be back with some additional questions!!
Thanks
Neil
I've been looking at Random Forest DTs and I have an additional question - the "rules" I am hoping to determine from the data would be of a positive form rather than a negative form e.g. "If(Rain = wet) AND (Temp = Cool) THEN (Grass = Surplus)" rather than "If(Rain not equals dry) AND (Temp not equals Warm) THEN (Grass = Surplus)"
Will it be possible to get the correct type of rule I'm looking to using a Decision Tree? Or would SVM / NN be more suitable??
Thanks
a standard Decision Tree (no Random Forest, no Boosting) is exactly what you need for this. You might be aware of the fact, that data mining models are in general not be that easily interpretable, but therefore usually more powerful.
Cheers,
Martin
Dortmund, Germany
Again, thanks for the response Martin
I've adjusted the data as I think I had it wrong previously. My data point now have the following attributes:
- ID
- Rainfall Month 1
- Temp Month 1
- Rainfall Month 2
- Temp Month 2
- etc up to Month 12
- Deficit / Balanced / Surplus for the year as a whole - this is the label.
Basically I each data point has a label and a number of attributes (all are nominal / polynominal) and I want to determine which attributes are most important in determining the label.
I've tried using a DT but I'm getting no results :-( Even with the Confidence parameter reduced to 0.001, I still get no tree.
Is DT the wrong method to use? Or does it sound like there's simply no patterns in the data??
Or is there potentially something wrong with my data?? I've reviewed the data and it seems to be pretty much as I'd expect
Any advice appreciated
Thanks
Neil
One additional question - for operators which don't accept polynominal attributes, is it ok to change these to integer? For example, in Rainfall, Wet is 5 and Dry is 1 but these numbers are essentially nominal - is it ok to assign them as integer??
Thanks
Neil
What do you mean by "no result"? Is there simply just a stump and no tree? In this case i would try to remove pruning and prepruning. If there is still no possible split, than it looks like your data has no patterns to find. Are you sure you connected the model port to the result port?
For the Nominal question: If it is a wet/dry decision it is fine to map them to 0/1. If you have more than 2 cases like dry - wet - saturated it might even make sense to map them to 3 integers. Of course this implies that saturated is 2x more than wet (or so). The other option is of course to go with Dummy Coding (Nominal to Numerical operator).
Cheers,
Martin
Dortmund, Germany
Yes, just a stump and no tree - I'll try removing pruning & pre-pruning to see if that gives any improvement.
At the moment I have 5 categories in the wet / dry attribute so I'm assigning them 1-5. This represents the 20 / 40 / 60 / 80 / 100th percentiles of my rainfall data so 2 is not double 1 - this being the case, is it valid to assign numerical values in this way?? Or is it acceptable?
I'm looking at my data again to see what / how I can adjust it - particularly I'm looking at adjusting the 5 rainfall / temperature buckets. In general, would it be better to create additional divisions i.e. 10 categories? Or less?
Thanks
Neil
there is no real theory on what is best. If there would be one, we wouldn't need tools like rapidminer and data scientist like us to solve the problems. Getting superb results usually depends on the best preprocessing.
In your case i would use 1-5 (20-100) because your values have an natual ordering. Of course it could be that another preprocessing is better.
Same is true for the number of rainfall and temperature buckets. Usually is it just "try it out and measure your performance".
Best,
Martin
Dortmund, Germany
That seems to work alright, I'm getting pretty decent results now
Thanks very much for your help Martin, I really appreciate it
Neil
this is wonderful to hear! I would really appreciate if you could sent me a link to your work once it is published. You can contact me either here or via mail: mschmitz at rapidminer.com
Cheers,
Martin
Dortmund, Germany