Best approach to use for some data analysis??

neilduggan · July 2015

Hi

First post on here – I’m looking for some direction on what analysis method to use for my thesis. I hope this post isn’t too long-winded and / or in the wrong location!!

I'm pretty new to RapidMiner and data analytics in general but I wanted to check if the approach I'm considering seems suitable based on the data types I have or is there something different which would be worth exploring before I start going too deep into it?

My thesis is essentially an analysis of a dairy farm model (grass growth & grass consumption) which outputs whether the farm has a deficit / balanced supply / surplus of grass & silage production during the year.

I am trying to assess whether it is possible to predict early in the year what the outcome will be for the entire year and if so, whether it will be possible to mitigate against a deficit or surplus.

I plan to categorically label my data set as below such that each data point will have the following associated with it:

- Month name: 1-12
- Month status: Deficit / Balanced / Surplus
- Year status: Deficit / Balanced / Surplus
- Rainfall associated with the Nitrogen application date in each month - 5 labels from Wet to Dry
- Temperature associated with the Nitrogen application date in each month - 5 labels from Wet to Dry

From this, I want to assess:

1. Whether there are Rainfall & Temperature combinations in the early months which can reliably predict a deficit / surplus later in the year? Does association rule mining sound like it would be suitable for this??

I was planning to set Year status as the label and combining the attributes for Month/Temp/Rainfall into a single attribute (or should these be left as individual attributes?) and see if I can come up with association rules for these? Or would something like decision tree or Apriori be better?

Then I'm going to go back to the model and for deficit and surplus years, I am going to adjust the amount of nitrogen applied (by a percentage based on how severe the surplus / deficit f) and see if it can bring the farm back to a balanced situation. This will generate new data for the deficit & surplus years with an additional label (again categorical) of how much the nitrogen adjustment is.

The second thing I want to assess is:

2. If certain "deficit" or "surplus" weather conditions occur early in the year, can adjustments be relied upon to bring the year back into balance? Again, does association rule mining sound like it would be suitable for this??

Ultimately, the output I'd like to generate is something like:

If "Temperature = Cool to average" and "Rainfall = average" Then "increase nitrogen rate by 10%"

Apologies for the long and detailed mail, I hope it makes sense.

Any direction on other techniques which would be worth considering would be greatly appreciated

Thanks

Neil

MartinLiebig · July 2015

Hi Neil,

is this all the data you have? It feels a bit like there could be more data measurement which might be useful to predict. In general 2-5 attributes are really not that much. You can easily handle a few dozens of variables.

In general i would NOT use associating mining for this. What you have is a supervised categorial learning problem. You can treat that with a lot of different algorithms (like Random Forest, SVM or Neural Net) and need to find out what works best for you.

Have you thought about using a time series like model? You would create a data set like
#Deficits Last Three Months
#Rainfaill==Wet Last Three Month

etc. Maybe you can even add stuff like Average(Rainfall) if you can assign useful numbers. Then you would predict "Is there a deficit in 3 Three month"?
I am not sure how this works together with your Month/Year deficit variables, but i think it is worth the thought.

neilduggan · July 2015

Hi Martin

Thanks for your response, much appreciated.

I have a few additional attributes e.g. day of application, N application rate, but I am trying to make the outputs of the analysis pretty generic so that they can be applied to a wide range of applications if possible. I thought that the best way to do this would be to deal in as few variables as possible – in this case time (month) and weather (temp & rainfall). Additionally, as the data being analysed is coming from a modelled dairy farm, it’s quite a slow-moving system so I wanted to look at the effect of macro factors like the weather. In the second part of the analysis, I will consider the Nitrogen application rate increase / decrease.

I will read up some literature about Random Forest, SVM & NNs, maybe run some trials on these and see how it goes. I’m sure I’ll be back with some additional questions!!

Thanks

Neil

neilduggan · July 2015

Hi

I've been looking at Random Forest DTs and I have an additional question - the "rules" I am hoping to determine from the data would be of a positive form rather than a negative form e.g. "If(Rain = wet) AND (Temp = Cool) THEN (Grass = Surplus)" rather than "If(Rain not equals dry) AND (Temp not equals Warm) THEN (Grass = Surplus)"

Will it be possible to get the correct type of rule I'm looking to using a Decision Tree? Or would SVM / NN be more suitable??

Thanks

MartinLiebig · July 2015

Hi,

a standard Decision Tree (no Random Forest, no Boosting) is exactly what you need for this. You might be aware of the fact, that data mining models are in general not be that easily interpretable, but therefore usually more powerful.

Cheers,
Martin

neilduggan · July 2015

HI Martin

Again, thanks for the response Martin

I've adjusted the data as I think I had it wrong previously. My data point now have the following attributes:

- ID
- Rainfall Month 1
- Temp Month 1
- Rainfall Month 2
- Temp Month 2
- etc up to Month 12
- Deficit / Balanced / Surplus for the year as a whole - this is the label.

Basically I each data point has a label and a number of attributes (all are nominal / polynominal) and I want to determine which attributes are most important in determining the label.

I've tried using a DT but I'm getting no results :-( Even with the Confidence parameter reduced to 0.001, I still get no tree.

Is DT the wrong method to use? Or does it sound like there's simply no patterns in the data??

Or is there potentially something wrong with my data?? I've reviewed the data and it seems to be pretty much as I'd expect

Any advice appreciated

Thanks

Neil

neilduggan · July 2015

Hi

One additional question - for operators which don't accept polynominal attributes, is it ok to change these to integer? For example, in Rainfall, Wet is 5 and Dry is 1 but these numbers are essentially nominal - is it ok to assign them as integer??

Thanks

Neil

MartinLiebig · July 2015

hi,

What do you mean by "no result"? Is there simply just a stump and no tree? In this case i would try to remove pruning and prepruning. If there is still no possible split, than it looks like your data has no patterns to find. Are you sure you connected the model port to the result port?

For the Nominal question: If it is a wet/dry decision it is fine to map them to 0/1. If you have more than 2 cases like dry - wet - saturated it might even make sense to map them to 3 integers. Of course this implies that saturated is 2x more than wet (or so). The other option is of course to go with Dummy Coding (Nominal to Numerical operator).

Cheers,
Martin

neilduggan · July 2015

Hi Martin (thanks again for your reply)

Yes, just a stump and no tree - I'll try removing pruning & pre-pruning to see if that gives any improvement.

At the moment I have 5 categories in the wet / dry attribute so I'm assigning them 1-5. This represents the 20 / 40 / 60 / 80 / 100th percentiles of my rainfall data so 2 is not double 1 - this being the case, is it valid to assign numerical values in this way?? Or is it acceptable?

I'm looking at my data again to see what / how I can adjust it - particularly I'm looking at adjusting the 5 rainfall / temperature buckets. In general, would it be better to create additional divisions i.e. 10 categories? Or less?

Thanks

Neil

MartinLiebig · July 2015

Hi,

there is no real theory on what is best. If there would be one, we wouldn't need tools like rapidminer and data scientist like us to solve the problems. Getting superb results usually depends on the best preprocessing.
In your case i would use 1-5 (20-100) because your values have an natual ordering. Of course it could be that another preprocessing is better.

Same is true for the number of rainfall and temperature buckets. Usually is it just "try it out and measure your performance".

Best,
Martin

neilduggan · July 2015

Hi

That seems to work alright, I'm getting pretty decent results now

Thanks very much for your help Martin, I really appreciate it

Neil

MartinLiebig · July 2015

Hi,

this is wonderful to hear! I would really appreciate if you could sent me a link to your work once it is published. You can contact me either here or via mail: mschmitz at rapidminer.com

Cheers,
Martin

neilduggan · August 2015

I certainly will Martin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Best approach to use for some data analysis??

Answers