MLOps: Big Data + Machine Learning + Deep Learning
Process industries such as manufacturing, oil and gas, energy monitor the assets for achieving efficient and optimized operations. Recovering from events like asset failures and maintenance shutdowns can cost a fortune and are highly undesirable from the business and operational perspective. With the advent of Industry 4.0, it’s now possible to gather and analyze data across assets enabling faster, more flexible, and more efficient processes to produce higher-quality goods at optimized costs.
Manufacturing sites are usually vast, and they may need thousands of IoT devices called sensors for monitoring. These sensors can generate huge amounts of data and cannot be managed effectively using only manual processes. This is where AI and ML Ops can become effective ally through the adoption of digital transformation.
Advanced AI models can accurately track, learn and forecast the hidden patterns in huge datasets and solve complex use-cases like multivariate anomaly detection and root cause analysis that go beyond the realm of numeric outliers to proactively detect abnormal conditions, thereby relating them to operational impact.
Most industry assets function 24/7 and hence, they require constant monitoring and real-time feedback. Depending upon the use-case, it may require a large number of AI models to be trained and deployed.Thus, building scalable AI pipelines for real-time streaming data is one of the most challenging engineering problems today.
Challenges in Streaming Data + Deep Learning
Deep learning algorithms are compute-heavy and require CPUs/GPUS/TPUs for training. Preferred frameworks for writing them are TensorFlow(by Google), Torch (by Open source community), and Keras(Integrated with TensorFlow and python).
We need to process continuous real-time streaming data and perform analytics with minimum latency. Speed, performance, and fault-tolerance are key factors for choosing relevant big-data technology for the streaming use-case.
Major big-data technologies for real-time streaming like Apache Flink and Apache Kafka are JAVA based; hence difficult to integrate with TensorFlow. Technologies like Apache Spark is heavily used in machine learning but are based on micro-batch models and aren’t suitable for processing large streams of data and performing real-time analytics. Our best shot is Apache Flink.
Analytics with Apache Flink comes with its challenges. It’s a JVM-based framework which is not compatible with the deep learning frameworks by default. One option for training those models would be to use DeepLearning4j and perform continuous training over batches of streaming data in Flink but that’d require reinventing a lot of algorithms that have been developed over the years.
Another option is what we’ll discuss next.
Training and tracking Eugenie’s deep learning models
Most heavy industries in the energy, oil and gas, water, and power sector use SCADA systems to monitor the entire site and automate regulatory tasks.
Eugenie’s training modules can collect historical data from SCADA systems and existing databases and use them for understanding the expected behavior of the machines.
The collected data in automation control systems are sequential and hence we can leverage advanced sequence-to-sequence models like modified long short-term memory(LSTM) recurrent neural networks to understand hidden patterns and forecast the future patterns in the data.
Using modified LSTMs, we can also detect abnormal behavior in live machines after feeding it the data for the normal operational machine. For solving such challenging problems, it’s imperative to choose a framework that is light-weight, efficient, and has strong community support. That’s the primary reason for the usage of TensorFlow and Keras to develop novel deep learning models, at Eugenie.
Training the models requires high computation power for which, we prefer GPUs and TPUs with advanced CUDA drivers. While training any deep learning model, it’s important to log the associated metadata, epoch losses, and accuracy metrics. The logs enable us to track the model’s performance and perform A/B testing and for this, we at Eugenie prefer to use MLFlow.
MLFlow is an open-source platform and can be used for managing the complete machine learning lifecycle. It provides us with an intuitive dashboard from where, we can monitor the model’s performance. After the evaluation, the model’s artifacts are deployed using TensorFlow serving.
Eugenie’s serving model artifacts and gRPC
After training our deep learning models, we need to expose an interface via which we can use the model for predicting test datasets. Most of the time, we have the training and testing dataset together but in our use case, the test dataset is flowing in real-time streams inside the Flink application, and the training of models is being done on an entirely different stack using TensorFlow and Keras.
Model servers simplify the task of deploying the saved model artifacts for any trained model. There are lots of active open-source communities that are building advanced model servers; seldon-core, kubeflow, and mlflow just to mention a few. We’re looking forward to seldon-core’s multi-model support for our deployment strategy.
In the machine learning lifecycle, apart from continuous integration (CI) and continuous delivery (CD), we also need to handle continuous training (CT) of models. With continuous training, the weights and biases get updated every time we train a batch giving rise to multiple versions of the same model.
With large and systematic delivery systems, it’s very important to version the models and serve the latest model and since there are lots of models, we need to deploy multiple models as well.
Deep learning models are all about suitable sets of weights and biases in the nodes. And those weights and biases can be saved for future usage in protobuf format and then can be loaded and served via TensorFlow serving.
Tensorflow serving is a high-performance serving system designed by Google for deploying machine learning models in the production environment. In deep learning models, inputs to the models can be large vectors which might make payloads heavy and eventually slow down the remote calls which are not good for streaming use-case where new data is kicking in every second.
gRPC is an open-source remote procedure call system, based on HTTP/2 which uses protocol buffers to make payloads faster, simpler and smaller.
Tensorflow serving exposes both REST and gRPC interfaces for getting model predictions. Essentially, it handles the following tasks:
• Versioning of models
• Multi-model deployment support
• Deploying a new model without downtime allowing us to do continuous training (CT) in parallel
• Exposes gRPC interface which is faster than HTTP
In the development and testing environment, we train our models using GPUs. Hence, we require setting up NVidia-docker to use GPUs and deploy TensorFlow serving via their GPU image on the docker hub.
For the production environment, we use Kubeflow and TensorFlow serving in our on-prem Kubernetes cluster which exposes REST and gRPC interface for usage.
At Eugenie, we constantly engage in research to improve TensorFlow serving’s performance. The co-location of servers for serving the trained models and performing analytics in Flink can also help in optimizing the speed and performance.
Stream processing in Eugenie with model servers
Streams in Apache Flink can be processed with respect to event time, ingestion time, or processing time.
Event time is the time of an actual event i.e. when the corresponding sensor generates data. Processing time indicates the system time of the machine which is running the Flink job. Ingestion time is the time that event enters the Flink job.
For Asset Monitoring use cases, it’s most suitable to process the streams based on event-time since we’d like to understand the behaviour of the sensors based on true event time.
Eugenie’s deep learning models for time series analysis are based on the windowing approach and based on our hypothesis, we pass a set of historical data points as input to get the future forecasts.
Windows in Flink are processed in parallel which gives high-performance and speed in processing. Flink being versatile, can process both unbounded and bounded streams and in our tech-stack, Apache Kafka serves as an infinite source of streaming data and Flink acts as a consumer; it processes the infinite streams further by filtering and windowing the stream into finite streams.
The window operation in Flink can either be count driven or time driven. Hence, we can use the same windowing concept we used for training the models in Flink using count windows and process the window function further by calling the model’s predict function via the exposed gRPC interface of our model server; this is how we get back the predictions.
In the end, we use the generated forecasts to analyze the streaming data in Flink and push back the relevant metadata (anomalies) back to Kafka which can then serve other clients as well.
As per Gartner, the size of AIOps platform market is estimated to grow between $300 million and $500 million per year and by 2023, 40% of DevOps teams will utilize AIOps capabilities to augment infrastructure and asset monitoring tools. AI/MLOps tools like Eugenie can play a vital role in boosting operational efficiency of companies of the heavy industries with the powerful capabilities of anomaly detection, root cause analysis, and deep learning.
To know more about Eugenie, register for a product demo at https://eugenie.ai/contact/