Predictive Modeling for Imbalanced Big Data in SAS Enterprise Miner and R
There have been a variety of predictive models capable of handling binary targets, ranging from traditional logistic regression to modern neural networks. However, when the target variable represents a rare event, these models might not be appropriate as they assume that the distribution in the target variable is balanced. In this article, the impact of multiple resampling methods on conventional predictive models is studied. These resampling techniques include the methods of oversampling of the rare events, undersampling of the common events in the data, and synthetic minority over-sampling technique (SMOTE). The predictive models of decision trees, logistic regression and rule induction are applied with SAS Enterprise Miner (EM) software to the revised data. The studied data set is of home mortgage applications which includes a target variable with an occurrence rate of the rare event being 0.8%. The authors varied the percentage of the rare event from the original of 0.8% up to 50% and monitored the associated performances of the three predictive models to see which one worked the best.