Unbalance data problem

Published:

Unbalance data problem is the problem that the machine learning algorithms have to face in order to model data that have unbalance labels. That’s there are a minority of classes that have little instances in comparison to the total of the dataset and the cost to miss in them is much lower to the majority class.

Most learning systems are not prepared to cope with unbalanced data and several techniques have been proposed to rebalance the classes. This package implements some of most well-known techniques and propose a racing algorithm to select adaptively the most appropriate strategy for a given unbalanced task.

The most general basic methods are thought:

  • Sampling: oversampling or undersampling in order to correct the biases.
    • Random undersampling: it is quicker, but you could lose important information.
    • Neighborhood cleaning rule (NCR): finds each data example whose class label differs from the class of at least two of its three nearest neighbors. If this example belongs to the majority class, remove it. Otherwise, remove its nearest neighbors which belong to the majority class.
    • Condensed Nearest Neighbor (CNN): selects the subset of instances that are able to correctly classifying the original datasets using a one-nearest neighbor rule. The goal is to eliminate examples from the majority class that are much further away from the border. Easy to model with balance and information important instances.
    • TOMEK links: the objective is removing the noise and borderline instances. That helps discriminatory algorithms easy to learn.
    • On-side selection: an undersampling method resulting from the application of Tomek links followed by the application of Condensed Nearest Neighbor.
    • Random oversampling: replicate the minority class elements in order to rebalance the classes. Hurts the generalization property.
    • SMOTE: Nearest Neighbors jitterized-driven replication of the minority class.
    • Borderline-SMOTE: application of SMOTE and TOMEK links undersampling.
  • Adapting learning algorithms (by changing decision thresholds or objective functions):
    • Boosting: meta-algortithm which uses weak learners to adaptively use a flexible learning of each point.
      • Standard Boosting:
      • AdaBoost:
      • SMOTEBoost:
      • RUSBoost:
  • Ensemble methods: a proper weighted collection of learning algorithms.
  • All combined: all the methods combined properly.

It is important to consider how noisy is the data and how many instances you have (or better, the information value per instance), or if there is heterogeneity value for instance (in order to data model) in the data. Regarding the previous information we have to select our methods. Direct properties that can conditioned this values are the overlap between classes and the density.

See also

Material

Papers