Big Data and Real Time Analytics
The advances in big data technology are opening up new ways to collect and transport large amounts of data more efficiently. This revolution has boosted research and development of real-time algorithms and methods. Traditionally, machine learning algorithms were not designed for real-time processing. In fact, data science competitions (e.g the Netflix prize, Kaggle) were criticised as the winning algorithms were often computationally expensive which made them impractical to use. This is rooted in the perception that the accuracy is more important that the speed of the algorithm, as the original setup of data mining was offline, often computed in batches. This started to change with the appearance of big data, as more algorithms were rethought in a scalable way. Most of the time scalability alone does not compromise the accuracy of the algorithms, as the computation is essentially the same. Real time processing of big data analytics has brought a more fundamental change, as it restricts the computational complexity of algorithms that can be employed in this scenario. A real-time streaming algorithm should meet the following criteria: it should process one example at a time, inspect it at most once, use a limited amount of memory, work in a limited amount of time and be ready to predict at any time.
To address these requirements, streaming algorithms were designed in a fashion that a learned model is constantly updated to reflect the incoming examples from the stream. After processing an incoming example, the algorithms are required to be able produce predictions regardless of data sparsity. Cutting-edge methodology for streaming data has come from a number of diverse directions, from on-line learning, randomised linear algebra, to distributed optimisation methodology for cloud computing, to multi-class classification problems in the presence of noisy and spurious data. Generally speaking, these methods are light, but some parts of the prediction might be based on pre-computed models. In fact, the offline-online cycle is a good compromise between traditional machine learning and real time analysis, as it pushes the heavier part of the model offline which is refined by the online part of the method as new observations come in.
Incremental learning algorithms represent an approach that fits well with the requirements posed in real time analytics. In essence, these algorithms have an offline core model that can be trained on historical data and new observations are rolled into the model incrementally. In order to keep the models fast, this incremental update only partially updates the model based on the concept drift that is detected in the stream, and the next full update commences offline at a scheduled time. This enables the system to react to new observations quickly, which is the compromise between speed and accuracy. It is important to note that depending on the type of algorithm employed, it is possible to update to model fully, in which case there is no need to maintain an offline part of the algorithm. In fact, the main criterion that makes an incremental algorithm an online learning algorithm is whether it can update the model and produce real time predictions.
Real time analytics have been employed in a wide variety of scenarios including social media, personalisation, finance and various scientific disciplines. However, tools that can deal with large volumes of data in real time are still scarce and are mainly in-house solutions.
Classification:
The Hoeffding option tree is an incremental decision tree induction algorithm. Hoeffding trees exploit the fact that a small sample can often be enough to choose an optimal splitting attribute.
Naïve Bayes is a very simple and computationally light classifier, the update of the model and the classification of the new samples can be done in real-time. Naïve Bayes is a good example of incremental learning without an offline component, as this model can produce prediction without historical data, which improves as new observations arrive.
Clustering:
StreamKM++ computes a small weighted sample of the data stream and it uses the k-means++ algorithm as a randomised seeding technique to choose the first values for the clusters.
D-Stream uses an online component that maps each input data record into a grid and an offline component that computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream.
Regression:
Incremental LDA updates the least square solution of the LDA when new samples arrive. The advantage of this method is that it performs a full update of the model which results in the exact least square solution of batch LDA.
SAIRT is an incremental version of binary regression trees. It adapts the induced model when facing data streams involving unknown dynamics, like gradual and abrupt function drift, changes in certain regions of the function, noise, and virtual drift. It monitors the usefulness of nodes and forgets examples from selected regions, storing the remaining ones in local windows associated to the leaves of the tree.