What kind of DM problem am I?

mobmob Member Posts: 37 Contributor I
I have a dataset comprising of a summary of workflow events with a class label (Each row represents the path a record takes through the workflow, so every row may not actually involve identical sequences)

I'm tasked with outlining improvements that can be made based on the data but I'm finding it difficult to identify the datamining task to be carried out.

Association Rules would only tell me whats commonly/ rare occuring in relation to attribute values 
Clustering is subjective so only really ends up describing whats similar /dissimilar in the dataset not what improvements should be made because of that.
Classification allows me to identify what type of class label should be applied based on past events

Besides pre-processing & outlier identification I'm not sure how to define the problem as a data mining question
Any advice

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,046  RM Data Scientist
    Hey,

    i think you have two different problems.
    The first one is to get your data in a format which is usalable for data mining. Either with aggregate/pivot/etc. or maybe with the Process Mining Extension?

    The second point is to find improvments. My first try would be a simple feature selection. The feature selection answers the question: "Which attributes causing the problems?" and then you can look at those attribtues and work with domain knowledge on them.

    By the way - if you have labeled data you most likely do not want to use clustering. If you can work supervised, work supervised.

    Cheers,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • mobmob Member Posts: 37 Contributor I
    Thanks for the help martin,

    After reducing the dataset down to features / interesting attributes would you suggest classification with decision tree to get a visual of the composition of the class labels ? Are there other DM processes that would produce a list of improvements or does that really involve domain knowledge to review the dataset in which case it is not really a traditional query based analysis of a dataset (parsing it down to a subset you are interested in and comparing the rows returned manually) based on domain knowledge?
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,046  RM Data Scientist
    Hi,

    it is very hard to get direct advice what to do for your specific company. However - what you get is: Important for a churn decision/defect/profit are attributes X,Y,Z. Usually if you look at it you directly have an idea what to do. E.g. get more younger customers, Change level 35 or produce slowly.

    What also comes to my mind is to use the resulting model as a part of a simulation. You could use it to simulate decisions. This how ever would involve a lot of manual work (an domain knowledge).

    As a side note: I would note use a dec. tree for that. You could use a random forest + weight by tree importance if you want to.

    Cheers,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 563   Unicorn
    I have a similar dataset to this one and this is the approach that I took to it. 

    It had this format.

    id | created-date | stageA date | stageA data | stageB date | stageB data | stageC date | stageC data | end date | end data

    First I summarised the data, how long does it take on average to reach each stage, etc.  From this I was able to highlight any areas with strange patterns.  For example: "Are there certain stages that take longer for some products than others?" 
    I then looked at if certain (dummy coding can work well for this as can decision trees) attributes or attribute values are more important than others. 
    Make sure you have a lot of domain knowledge here as some fields might need to be removed as they are an indication of the outcome and not an indication of what influenced it.  (For example label = churn, most important attribute & value is status = cancelled; doesn't really tell you much, remove it). 

    The next step was to go more in-depth into the dataset.  I used RapidMiner to loop through the dataset to turn the workflow summary back into a workflow (similar to this pattern)

    id | stage | date | data
    id | A      | data | data
    id | B      | data | data
    id | C      | data | data
    id | End    | data | data

    This then allows me to use the process mining extension to analyse the data to analyse the workflow and see and represent if there are any problems. 
    Examples of things you might find: email address is empty => customer more likely to churn, customer purchases sweets & chance of churn decreases for 3 months. 
    Solution for business: send everyone without an email an offer on sweets if they give their email address.  ::)
Sign In or Register to comment.