Confused how to approach my data, to start by Clustering? or Prediction directly? or a better idea?

Gonfiaf_ZuraikGonfiaf_Zuraik Member Posts: 9 Contributor II
edited December 2018 in Help

Dear all,

 

I am working with a dataset, that contains more than 8456rows, 26 columns. this data is about projects that are taken place in Europe, each row is a project.

these are the columns: 

Office Office Country Competence Executive competence Classification Enquiry date Creation date Confirmation date Proposal Date Final invoice sent date Intermediary Customer ID Customer Event Group name Reference code Start date End date Project manager Main contact Via sales contact Project location Project country Heard About Us Source Market Client Kind Client Sector Region Market Lead Sent to Event Frequency Pipeline Future Projects Initial Pax Estimated turnover Estimated costs Estimated profit % Status Pax Net turnover Net costs Gross profit Gross profit % Net profit Net profit % Agency commissions Supplier commissions Cancellation/Rejection reason Cancellation date Remarks Controlled Financial Regime Currency Exchange Rate Payment status % Required(Net) Required Invoiced To invoice Receipt To pay Custom invoices Balance carried forward Comments to low margin Debits Assets Balance TO Inv. TO Acc. TO Total Cost Eff. Cost Man. Cost Acc. Cost Total

 

for privacy policy I cannot expose the data itself, so I created an imaginary data just for illustration: 

Office Office Country Competence Executive competence Classification Enquiry date Creation date Confirmation date Proposal Date Final invoice sent date Intermediary Customer ID Customer Event Reference code Start date End date Project manager Project location Project country Heard About Us Source Market Client Kind Client Sector Region Initial Pax Estimated turnover Estimated costs Estimated profit % Status Pax Net turnover Net costs Gross profit Gross profit % Net profit Net profit % Agency commissions Supplier commissions Cancellation/Rejection reason Cancellation date Remarks Controlled Financial Regime Currency Exchange Rate Payment status % Required(Net) Required Invoiced To invoice Receipt To pay Custom invoices Balance carried forward Debits Assets Balance TO Inv. TO Acc. TO Total Cost Eff. Cost Man. Cost Acc. Cost Total
Saint Louis Senegal BL Saint Louis Unknown 22.02.2016 08.04.2016 08.04.2016 23.02.2016 08.04.2016   11896 Customer2 zina 2016 code e1 2 15.04.2016 16.04.2016 Maya Saint Louis 1 hall Senegal   BL Agency Other   35 0 0 0 Completed 35 1.950 1.486 463 24 122 6 0 0         Input/Output EUR 1 100 1.950 2.321 2.321 0 2.321 0 0 0 0 0 0 1.950 0 1.950 0 0 1.487 1.487
Saint Louis Senegal BL Saint Louis Other 08.06.2016 08.07.2016 08.07.2016 14.06.2016 25.07.2016   43 Customer3   code e1 3 07.07.2016 07.07.2016 Maya Saint Louis Senegal   BL Agency Other   0 200 0 100 Completed 0 297 9 288 97 236 79 0 0         Input/Output EUR 1 100 297 354 354 0 354 0 0 0 0 0 0 297 0 297 0 0 9 9
Saint Louis Senegal BL Saint Louis Embassy 19.05.2016 20.05.2016 04.08.2016 04.08.2016 04.08.2016   1978 Customer4 leab 2016 code e1 4 11.09.2016 16.09.2016 Laura Saint Louis Senegal   BL Agency     32 12.000 0 100 Completed 32 9.614 7.416 2.197 23 515 5 0 0         Input/Output EUR 1 100 9.614 11.441 11.441 0 11.441 0 0 0 0 0 0 9.614 0 9.614 0 0 7.417 7.417
Saint Louis Senegal BL Saint Louis Embassy 20.05.2016 21.05.2016 28.06.2016 28.06.2016 04.08.2016   1978 Customer5 leab 2016 code e1 5 12.09.2016 16.09.2016 Laura Saint Louis Senegal   BL Agency     12 4.500 0 100 Completed 12 4.550 3.526 1.024 22 227 5 0 0         Input/Output EUR 1 100 4.550 5.415 5.415 0 5.415 0 0 0 0 0 0 4.550 0 4.550 0 0 3.526 3.526
Saint Louis Senegal BL Saint Louis Unknown 21.03.2016 01.04.2016 15.06.2016 01.04.2016 28.11.2016   807 Customer6 festival 2016 code e1 6 23.09.2016 25.09.2016 Martin Saint Louis Senegal   BL Agency     20 18.000 0 100 Completed 20 11.276 9.676 2.104 19 130 1 0 503         Input/Output EUR 1 100 11.277 12.815 12.815 0 12.815 0 0 0 0 0 0 11.277 0 11.277 0 0 9.676 9.676
Saint Louis Senegal BL Saint Louis Unknown 28.06.2016 29.06.2016 10.08.2016 10.08.2016 14.09.2016   43 Customer7   code e1 7 04.10.2016 05.10.2016 Laura Saint Louis Senegal   BL Agency Other   30 6.000 0 100 Completed 30 4.789 3.778 1.011 21 173 4 0 0         Input/Output EUR 1 100 4.790 5.700 5.700 0 5.700 0 0 0 0 0 0 4.790 0 4.790 0 0 3.779 3.779
Saint Louis Senegal BL Saint Louis Unknown 05.08.2016 06.08.2016 10.08.2016 10.08.2016 10.08.2016   2374 Customer8   code e1 8 04.10.2016 06.10.2016 Laura Saint Louis Senegal   BL Agency Other   2 1.500 0 100 Completed 2 2.007 1.753 254 13 -97 -5 0 0         Input/Output EUR 1 100 2.008 2.228 2.228 0 2.228 0 0 0 0 0 0 2.008 0 2.008 0 0 1.753 1.753
Saint Louis Senegal BL Saint Louis Incentive 01.09.2016 02.09.2016 29.11.2016 06.09.2016 02.11.2016   535 Customer9   code e1 9 19.10.2016 20.10.2016 Larissa Saint Louis Senegal   BL Agency Other   15 2.700 0 100 Completed 15 2.240 1.736 503 22 111 5 0 0         Input/Output EUR 1 100 2.240 2.666 2.666 0 2.666 0 0 0 0 0 0 2.240 0 2.240 0 0 1.737 1.737
Saint Louis Senegal BL Saint Louis Incentive 22.09.2016 12.10.2016 23.11.2016 14.10.2016 07.11.2016   43 Customer10   code e1 10 19.10.2016 20.10.2016 Maya Saint Louis Senegal   BL Agency Other   25 1.000 0 100 Completed 25 2.360 1.433 926 39 513 22 0 0         Input/Output EUR 1 100 2.360 2.808 2.808 0 2.808 0 0 0 0 0 0 2.360 0 2.360 0 0 1.434 1.434
Saint Louis Senegal BL Saint Louis Incentive 05.07.2016 06.07.2016 11.01.2017 12.07.2016 04.11.2016   535 Customer11   code e1 11 21.10.2016 22.10.2016 Larissa Saint Louis Senegal   BL Agency Other   24 4.500 3.500 22 Completed 24 7.513 6.404 1.109 15 -206 -3 0 0         Input/Output EUR 1 100 7.514 8.791 8.791 0 8.791 0 0 0 0 0 0 7.514 0 7.514 0 0 6.405 6.405

 

 

for these data, I want to make analysis and predictions/classifications to get new insight of the data and to contribute something. I am using this data from the company in order to help me write my master thesis upon. 

I need to make a data mining process, predicting for example the Net turnover of next year, or to make cluster classification and to get new insights, 

I am new somehow to this in rapidMiner and I am struggling in choosing my appropriate path for starting. 

 

I thought about to generate two new columns at the beginning (inside the Turbo Preparation) one column called

"Year"=that takes the year of each project

and another column

"Poject's length"= that counts how many days each project lasts

 

i need to know please with these attributes that I have, can I reach to a satisfying result? do you have any ideas ? I am stucked in the middle with too much data and dilemmas inside my head which prevents me to concentrate and take the right approach 

that's why I need some wet ideas, some motivations and recommendations please

 

I thought about Clustering, and getting insights from the clusters i'll get, and then upon it to continue with a decision tree model that predicts the next years net turnover for example,  (it can be another idea rather than predicting the turnover if you have any, im open to everything)

 

I tried to make the auto model and to cluster, but actually im not getting any useful results. I guess there might be 2 reasons for this:

1. that I do not know how exactly to approach this procedure, and I am missing something.

or

2. the data that I have is not enough good for this type of approach

 

any help please guys ? 

 

@sgenzer @jczogalla @David_A @mschmitz @stevefarr @Pavithra_Rao

 

 

Tons of Thanks and Gratitudes.

 

Kind regards,
Jana

 

Tagged:

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
    You could start with some simply exploratory data analysis to see the relationship between your attributes.  How about some simple weighting by correlation or by information gain?
    You could also use clustering to see what kind of patterns are in the data.  You should also look for outliers.
    Another option would be to reformulate your target label, sometimes predicting a continuous numerical (like net turnover) is more difficult.  Could you redefine it into a classification problem, by setting a threshold level of net turnover and then assigning a class (either above that level or below it)?
    Without seeing your actual data, it is almost impossible to say whether there is enough predictive power in your attributes to do a good job predicting your outcome.  But these are a few other things you should try.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • M_MartinM_Martin RapidMiner Certified Analyst, Member Posts: 125 Unicorn
    Hi: In addition to the great advice from Telcontar120, perhaps it would also be a good idea to ask the people who gave you the data (if you haven't already) how they collected the data, the meanings of all of the data fields, and what they are hoping you might find and why, and how whatever you find out will actually be used.  This might help you formulate and set goals as to what exactly you would like to learn or need to learn from exploring the data. If there's anyone you could talk to who has experience managing or has worked with people involved in some of the projects, this might give you some ideas.
    If they just gave you the data and said "Find something interesting", you would certainly want to try and discover some interesting relationships between the various data fields which you could then talk about with the people who gave you the data, which might lead to you learning more about the meanings of all of the data fields or what your colleagues would like you to concentrate on.
    You may also want to check for missing and NULL data values in the various data fields, and look for any inconsistencies in the data values in the various data fields because if the data is not entered in a consistent manner, this could make it more difficult for RapidMiner to find interesting relationships between the data fields.  It's usually helpful to get a sense of minimum, average, median, and maximum values for the numeric data fields and how evenly (or unevenly evenly) the data for each data field is distributed.
    Hope this helps, good luck, and best wishes, Michael Martin
Sign In or Register to comment.