RapidMiner 9.8 Beta is now available
Be one of the first to get your hands on the new features. More details and downloads here:
Optimize Auto Model towards Sensitivity
Some context on my problem: I have an unbalanced dataset with 3k observations from which about 5% are successful companies and 95% unsuccessful ones. The underlying definitions of successful/failure are not relevant here as the dataset contains only labels 0 (failure) or 1 (successful). For every company, I have about 150 features which were identified at point in time t1. The label successful/unsuccessful was identified at point in time t2 because at point t1 it's unclear whether the company will become successful or not.
Goal: Based on the information we have at point t1, I want to predict whether the company will become a success or failure at point t2. The model should serve as a pre-selection tool for venture capital investors to figure out on which companies to focus their attention, i.e., which have the highest likelihood of success. In venture capital, only very small number of portfolio companies account for the majority of the fund's return. The majority of companies are failures and don't return anything. The return distribution is similar to a pareto distribution where 20% of companies account for 80% of returns. Consequently, the investor cannot afford to miss out on any of the success cases. This means that while it's okay to wrongly classify failures as success, it's not okay to wrongly classify a success as a failure, i.e., I need to optimise the model towards sensitivity (avoid false negatives).
Problem: After running the Auto Model, I have 2 questions: 1) With the default setting only Naive Bayes leads to a sensitivity different to 0, i.e., 87.5%. How can I optimize all models towards sensitivity? 2) How can I limit the number of success predictions? Once I want to optimize the model towards sensitivity (avoid false negatives), the model could easily predict every company as success and end up with 100% sensitivity. Is it possible to limit the number of success predictions to a specific threshold, e.g., 20% of the sample size?
Really looking forward to your help & thanks already upfront!