The Resampled Data in Imbalanced Classification
Information Age Publishing Inc.
Contemporary Perspectives in Data Mining, Volume 4
Classical positive-negative classification models often fail to detect positive observations in data that have a significantly low positive rate. This is a common problem in many domains, such as finance (fraud detection and bankruptcy detection), business (product organization), and healthcare (rare diagnosis). A popular solution is to balance the data by random undersampling (RUS), that is, randomly remove a number of negative observations or random oversampling (ROS), that is, randomly reuse a number of positive observations. In this study, we discuss a generalization of RUS and ROS where the dataset becomes balanced, so that number of positive observations matches the number of negative observations. We also propose a data-driven method to determine the size of the resampled data that most improves classification models.