In Analytics, Machine Learning, Technical

SpaceTime’s distributed machine learning library, Lateo, provides an easy to use framework for building, training, and implementing hierarchical dynamic Bayesian Networks. Using algorithms such as Hidden Markov Models and Hierarchical Mixture models, Lateo gives us a generalizable machine learning engine to do anomaly detection, asset failure time prediction, change point detection and more, at scale. When SpaceTime developed Lateo, we had a number of choices to build to. We considered MPI and other distributed processing framework options, but ultimately chose to implement our algorithms on Apache Spark for its expressiveness, multi-language support, extensibility, and the fact that it has become the language of choice for large scale data science projects. Lateo is our proprietary distributed machine learning library written in Scala, used in our industrial internet of things analytics applications. While Spark’s MLlib is widely used, Lateo is a much more flexible tool for ML and modeling in general. Lateo has the ability to construct complex estimators for any data type and data structure and includes algorithms critical to our work that are not available in MLlib today. Lateo gives us the ability to model fairly complex systems in real time. Still, we wanted the same advantages you get when using MLlib — multiple language support, and the speed and scalability from distributed computing across Spark clusters. Lateo, unlike some other ML libraries for Spark, is a distributed machine learning library, so we can take advantage of the power of large clusters in the cloud. Further, we make extensive use of Kafka in our platform for performing analytics on real-time data streams, and Kafka’s integration with Spark was essential.

In some of our scalability tests, the advantage we get from distributed computing on Spark became clear. In memory-bound problems, we were able to scale the training of a highly dimensional hierarchical Bayesian Network from 1 to 250 nodes in a Spark cluster with only about a 50% drop in the weak scaling efficiency versus linear. More importantly, the loss of scaling efficiency flattened out dramatically after 100 nodes, indicating that we could scale above 250 nodes without much further loss, and we could expect to get excellent efficiency with different data set sizes and different model sizes. This is an important point which validated our efficient implementation of Hidden Markov models with topic mixtures. We also found that we could optimize the number of nodes in a cluster for CPU-bound problems to find the optimal efficiency for large data sets.

One of the common data science problems that Lateo solves elegantly is the ability to model right-censored data sets. As noted, we use Lateo’s Hidden Markov model on Spark to predict failure times for different types of assets. An example we can discuss, using public disk drive data from Backblaze, is below.

predicting drive failure with our distributed machine learning library

SpaceTime Insight disk drive failure analytics

Typically, software for monitoring drive health is condition-based. Using one or more data points from the drive’s SMART monitoring system, the software will use predetermined thresholds to warn drive owners that a drive may fail. Better than nothing, but not truly predictive. We used our Lateo ML library to build a predictive drive failure model. Using the same SMART sensor data, our unsupervised machine learning model discovers from the data the states that drives are in at any given time, and the probabilities that a drive will transition from one state to another. Ultimately we’re able to build a full model of a drive’s path to failure and predict the probability of multiple possible failure times for each drive. The result allows us to build not only a better model for predicting when a drive will fail, but to better model the real business problem associated with asset failure.

In our applications we’ll use other machine learning models to optimize the prescription for drive replacement, balancing the cost of false negatives against the cost of prematurely replacing the asset, thereby extending asset life as much as possible, increasing return on capital expenditures. We discussed this topic in some more detail in a recent white paper.

Please contact us if you’d like a demo of the disk drive analytics, or to learn more about our distributed machine learning library Lateo, or any other SpaceTime solution.

 

Recommended Posts

Leave a Comment