Learning from streaming data with concept drift and imbalance: an overview
The primary focus of machine learning has traditionally been on learning from data assumed to be sufficient and representative of the underlying fixed, yet unknown, distribution. Such restrictions on the problem domain paved the way for development of elegant algorithms with theoretically provable performance guarantees. As is often the case, however, real-world problems rarely fit neatly into such restricted models. For instance class distributions are often skewed, resulting in the “class imbalance” problem. Data drawn from non-stationary distributions is also common in real-world applications, resulting in the “concept drift” or “non-stationary learning” problem which is often associated with streaming data scenarios. Recently, these problems have independently experienced increased research attention, however, the combined problem of addressing all of the above mentioned issues has enjoyed relatively little research. If the ultimate goal of intelligent machine learning algorithms is to be able to address a wide spectrum of real-world scenarios, then the need for a general framework for learning from, and adapting to, a non-stationary environment that may introduce imbalanced data can be hardly overstated. In this paper, we first present an overview of each of these challenging areas, followed by a comprehensive review of recent research for developing such a general framework.