i am doing a predictive analysis on employee attrition.

joseph · June 2016

I wanted to know which employee has higher chance to leave in month/ quarter/year. I have made the employee data as called out below. I wanted to know how to take it further. I am using rapid miner for the first time, pls help

Empl no, Age, Total gross salary, compensation increment - which is integer

Name, gender, role, Job level, qualificaion, practice unit, city, appraisal rating for last 5 years, onsite/ offshore, role maturity, - polynominal

Date of Birth, Date of joining, Date of confirmation, last revision date, last role change date -date time

Previous experience, total experience, Loss of pay duration, utilization-real

Regards

Joseph

MartinLiebig · June 2016

Hi Joseph!

Great to see you here! Your problem is very well suited for predictive analytics. In fact our tutorials are on a very similar use case. Did you have a look on http://docs.rapidminer.com/studio/getting-started/ ? The key question is, if you have a label. Means historic data where you know the truth?

If you have any specific question I am very happy to answer it.

Best,

Martin

joseph · June 2016

thanks Martin for the reply. I have gone thru couple of videos but i am not getting the right way to manage it. how would i know which one is a label from the entire list of data.

Regards

Joseph

MartinLiebig · June 2016

Do you have the historic information wether the employee has left the company or not?

This would be the label.

~martin

joseph · June 2016

hello Martin,

I have added an attribute as status which tells whether the employee is working and i have called it as label. But while i run the data and click on results i see question mark against couple of employee. Why?

Regards

Joseph

joseph · June 2016

Hello Martin,

All the folks are active in the system as on date. I have created a column called as status and mapped all employees as active and marked it as label. While i run it, in results tab i see for couple of folks there is a question mark in the status column. Need to know why?

MartinLiebig · June 2016

Joseph,

to do supervised learning you would need to have "quitters" as well. Then the algorithm would learn the rules to distinguish between quitters and stayers.

If you cannot get the data with the quitters, you can go unsupervised (or semisupervised), but this is way harder that the supervised task. Is there any way that you can get the quitter data?

Is there also maybe a chance for you to attend a rapidminer training? We cover the topic very extensivly there.

To the "?". This indicates a missing value. That might be generated if you divided by 0 or something. You can replace them with the Replace Missing Vlaues operator

~Martin

Telcontar120 · June 2016

Joseph, you may also be interested to look at the template process for "churn modeling" which is very similar to the idea of predicting employee attrition. You can easily access that sample process when you first start up RapidMiner (see the attached screenshot). That process also includes information about defining the label that you need to do the predictive modeling for this type of problem.

churn modeling.PNG

joseph · June 2016

Dear Mark, Brian,

Appreciate the effort you are putting in to help me out. I have a question. I have got the quitters data now. I have employees who are being labelled as active, on notice and inactive. Will this help me with the prediction of who will be quiiting from the active list. what should i do to get this.

Second, in my earlier discussion i had mentioned aboutt question mark against some of the employee. What i have done is i have checked on the option of replace missing value with errors. Is that okay?

Regards

Joseph

joseph · June 2016

Sorry Martin, I misspelled your name. The question was for you and Brian.

Regards

Joseph

Telcontar120 · June 2016

1) If you have data now for quitters, then you can put together a dataset from some time period in the past when you had both quitters and non-quitters. Your label will likely be defined as a binary categorical variable (quitters vs non-quitters). But you need to make sure that the attribute values are from a period in time that precede the time of the label, otherwise you will be "peeking into the future" with predictor values that actually occur after the time of your outcome variable. Once you have that dataset defined, you will be able to train a model to predict quitters (using the data for quitters and non-quitters), and then you will be able to apply the model for people who are active today who you want to predict their likely performance in the future.

2) If the missing value is for the actual label, then you are probably better filtering those examples out together so they do not bias your model. If the missing values are for any of the other attributes, then as long as there are not too many of them then in general you should be able to use the replace missing values operator with minimal impact.

joseph · June 2016

Dear Brian,

As per your response, if i look at my data, the status of the employee is called as label (which tells me whether the employee is active, inactive, on notice) and there are no blank records for the label. 1) Is there an issue in this.

2) Can you pls clarify your statement "But you need to make sure that the attribute values are from a period in time that precede the time of the label".

3) I have gone through the demo "customer churn" in the label they have defined as loyal, churn and some records are blank. On what basis the customers are defined as loyal.

Regards

Joseph

MartinLiebig · June 2016

Hi!

1) Is there an issue in this.

This sounds very good. You need to decide if you want to do a three or a two class problem. Do you really need the notice option? Or is it fine to built a classifier to distinguish between active and not active?
The reason why I ask is, that it makes the problem algorithmically easier to solve, if you only have two classes. Often you have problem with labels like "positive, neutral, negative" where you simply classifiy negative and positive and ignore negative during learning. If you need all three classes, you can of course also do this in RapidMiner.

On what basis the customers are defined as loyal.

In the demo we assume that customers who stayed with use (at least until now) can be called loyal.

~Martin

joseph · June 2016

thanks Martin for your guidance. I did with three but was not very happy looking at the result. As you mentioned i will exclude employees who are on "on notice".

Question here is can i classify all my active employees as loyal. With this will it help me to get the right prediction?

Regards

Joseph

MartinLiebig · June 2016

Good question... I would say it might help, but it could be a lenghty discussion.

On the one hand more data is always better (for nonlinear algorithms at least). So if you can get more data on the loyal side it is fore sure helpful. On the other hand you generate a unbalanced data set, with way more active than non-active employees. You need to handle this. Furthermore you might introduce a tricky thing - active who will quit. So the label becomes unsure.

My advice for you would be to first think about optimizing your algorithms etc. What did you use as a algorithm? Have you performed feature selection? What about feature generation? I would assume there is a lot of potential to get better.

Is there the option for you to post the process? Possibly you cannot post the data, right?

~Martin

bhupendra_patil · June 2016

Hi Joseph,

? mark represents missing values.

Depending on which column has missing values you can chose to handle it differently.

E.g if missing values are present in the label column itself, then potentially you cna assume that you will predict the value for that.

If missing values are present in a numeric column based on your understanding you can chose to replace it with a value, like average, min, max zero or something. If lets say value was missing for country, and most employeees are based in "USA" then you can safely assume USA..

So how to handle missing values will come from your business understanding.

For attrition information, what you can do is lets say get data for last 5 years or something like that, then create summary for employee level, this will be combination fo aggreagate, joins etc. So basically you will summarize your employees history like

name, tenure length, age , salary, max rating, min rating, average rating, max salary raise, average raise, average bonus, commuting distance... etc

there are endless possibilities to what attributes describe your employees..you knowing the business can have an understading of what these could be , eventually Rapidminer will help you understand which are good indicators or not...

Once you have one row representing one employee, you should create a new column desrcibing current employment status...that is your label..

so any one who left is marked as "departed, anyone who is sitll working marked as "current" and then change role of this column to Label.

Then you can use various learning algo to find the patterns and learn from it

Once learned you cna then apply the model on current list of employeees and predict who will potentially depart..

One thing you may want to consider is instead of creating a row per employee , you should create like row per employee per year. that will represent

so it will look something like

Emp ID-YearName -age-------- Tenure -- Active

1-------- 2014John -- 24years - 1 year -- Y

1-------- 2015John -- 25years - 2 year --Y

1---------2016John --26years - 3 year - N

Edit: fixed some Typos

joseph · June 2016

Dear Martin, Bhupendra,

I have attached the process flow which i am using to find the predictive employee attrition, but its giving me a wrong prediction. I am calling out what i have done till now-

1) Made employee data where in target i have marked "active","inactive" as label.

2) I have missing value in Date of confirmation, last role change date for which i have used select attribute and have clicked on invert selection for these 2 attributes

3) I have converted all attribute from polynominal to "nominal to numerical" because i will be using logistic regression

4) I have then applied set role wherein i have called the attribute name and given the target role name as label and have clicked on edit list marked under set additional roles and have called out the empl. no. under attribute name and then mapped ID as the target role

5) then I have used the split data and called out the ratio .7 and .3

6)post which i have applied logistic regression and mapped it to apply model and then used the performance classification operator and mapped it to results

Can you tell me where i have gone wrong or what needs to be done to get the right process and the right prediction?

Regards

Joseph

MartinLiebig · June 2016

Hi Joseph,

the process seems to be fine as a baseline. I would use a cross-validation instead of splitting it. Afterwards you need to iterate through differen t algorithms, parameters and feature selections together with feature generation to improve your results.

~Martin

joseph · June 2016

Hi Martin,

Thanks for the reply. There are 6 sub operators under feature selection and 5 under Feature Generation. Which sub operator to use to generate the prediction?

Are there any training institutes in India (BLR, CHN, HYD, PUN, TVM) which will help me to understand more on the usage of rapidminer?

Regards

Joseph

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

i am doing a predictive analysis on employee attrition.

Answers