Introduction

In this article, I’d like to walk you through how the incorporation of ML Ops practices might be of great assistance to an engineering team looking to apply ML to solve a problem.

The case

So, meet Riley, an engineer at Nicr. Nicr (as in “nicer”, but hipper) is a small startup developing a new model of interaction in social networks. Of course, I just made both Riley and nicr up completely. But bear with me; it will be worth it. Plus, you might just end up with an inspiring idea for a brand new social network after reading this article! That can’t be too bad now, can it?

OK, don’t quote me on that last part.

The product

We’ll keep it simple. Nicr has built an online social platform where users can:

  • Create posts
  • Comment on posts
  • Up-vote or down-vote posts and comments.

But there’s a twist: as a user is about to submit their post or comment, the product team at Nicr wants them to recieve an automatically generated feedback scoring the “niceness” of their text in the form of a a numerical score from 0 to 10. The user will then be able to decide whether to post their content as is, modify it, delete it or ignore the niceness feedback altogether. The product team at Nicr has dubbed this the nice-a-meter.

They’re quite excited about it.

The implementation

Remember Riley? Tasked with implementing this spceial feature, Riley is quick to conclude her chances of success lie in the realm of Machine Learning. Being an experienced engineer and having always held an ear to the ground of tech, she knew well enough that sentiment analysis was a thing. Knowing how far ML has come and having played around with some of it on her own, she was also pretty sure she’d be able to easily find some tools and information to help her out. Indeed, a few queries to her favorite search engines yielded a slew of promising tutorials, articles and libraries.

Soon thereafter, having gone through Hugging Face’s great quick tour and a visit or two to Stack Overflow, Riley has a working snippet of code that’s able to process an arbitrary string and return a sentiment score. To hook it up, she deploys a server application with this code and its dependencies to her IaaS of choice and exposes it through a simple HTTP API offering a POST /sentiment-analysis endpoint. Her client apps asynchronously send their requests to this server for each comment or post about to be submitted. Eventual issues of scale would be managed by creating a server image, scaling it horizontally and load balancing.

So far so good.

The problem

Nicr’s social network grows, its teams grow and the platform gains a healthy base of loyal users. Problems arise, as they always do and always will; but they’re swiftly curbed by Nicr’s teams filled with fresh energy and motivation.

At a certain point, however, one particularly pernitious problem starts to make itself known. With conspicuously increasing frequency, user-submitted reports are coming in of unexpected nice-a-meter results on their content. Riley, now head of Nicr’s engineering team, takes a look at reported cases to get a sense of the issue. With an idea in mind of what could be going wrong, she and her team sit down to have it out with the nice-a-meter themselves. Things do seem a bit off, and soon enough they form a solid hypothesis of the root of the problem.

As Nicr’s platform grew and its community became tighter, so did the language employed by its users evolve. As this happened, new and idiosyncratic ways of conveying sentiment emerged and flourished. Riley’s initial Hugging Face out-of-the-box model, however, stayed back in time: it had never “seen” this language in its training, with new expressions coming up that could not have been reasonably associated with their new-found meaning and new words previously unused in human interaction. In other words, Riley’s ML model had caught a strong case of model drift: it had stagnated, its current training clearly outdated with respect to the evolving nature of the data it’s employed to process.

An initial solution

Broad strokes: if a model is outdated, then an update is in order. A few questions therefore immediately come to mind

  • How might we train our model on our own data?

  • How might we validate the new model is better than the old one?

Having been in charge of shipping the nice-a-meter initially, Riley takes on this task. Digging back into search engines Hugging Face’s Transformers docs, she learns that their model can be fine-tuned with custom datasets, and that once such a dataset is ready it could also be used to validate the new model’s performance by running it on train/test splits or cross validation. In need of a dataset , Riley mines the logs for samples that include interesting cases of mis-classification to be manually supervised, and tagging work is divided among the team .

As for the fine-tuning code, it becomes clear to Riley that this will not be as breezy as setting up the out-of-the-box model was; a decent amount more of research and trial-and-error is required. But, before long, her fine-tuning code is ready to go. With the dataset all cooked up, a new model is trained and to encouraging results: running some metrics on the test split using the old model and the new one shows a significant improvement, validating the team’s suspicions and their efforts towards a fix. The new model is deployed in much the same way that the first one was, with Riley manually uploading it to a server’s disk, creating a new image and rolling it out using her cloud computing service’s general-purpose deployment functionalities.

What about the next time?

Nevertheless, a third and important question had also come to Riley’s mind when she understood what had happened:

  • What about the next time?

Just as the nature of data in the wild had changed enough at that point to cause drift, so it is bound to do again. Would they just wait for user reports to pour in again and repeat the exact same process? They’d have to manually prepare a dataset, spend time dividing and managing supervision work, manually re-train and manually re-deploy. Monitoring will be left again to an indirect means such as mining subsequent user-submitted reports. Whenever this problem arises again, they’d have to scramble to figure out what’s causing the new drift before even having a chance at working on fixing it. Needless to say, all the while they’d be allowing a degradation of service to take place that they might not be able to afford, and crucial work hours would be allocated away from other important bugs and features.

And right as Riley is sitting at her desk pondering these worrying questions, a Slack message from the product team makes its way to her screen: Hey Riley! Listen, looks like spam is starting to become a problem. I’ve heard you can use Machine Learning for that?

What about new algorithms?

So yes, Machine Learning is probably the best solution for spam classification. Any data scientist’s introductory “hello world” classifier, it’s also pretty much a solved problem. However, spam may be idiosyncratic to a platform as well, and if Nicr’s exeprience with sentiment analysis is any indication, it will become so more and more with time. To make things worse, it now appears that spammers have been using images to post their spam on Nicr, with obfuscated text in images and watermarked logos of sketchy online businesses. So, much of what they had done for their sentiment analysis problem would not be of use, having new challenges in setting up ways to tag images and handle image datasets, manage performance and hardware issues related to running ML on images and more. Should they invest time and resources in creating and standardizing processes to monitor, re-train and deploy models?

It couldn’t be any clearer to Riley that a better solution is needed.

Enter ML Ops

Let’s take a step back and have a look at the bigger picture. Riley and Nicr’s fictional case makes the following point: things get really messy as the use of Machine Learning is extended in application and in time within an organization. Scaling ML models, keeping training data up-to-date and ensuring models are in tip-top shape is hard, and becomes harder as the number of models increase and the kinds of problems they’re applied to diversifies.

That’s where ML Ops, or Machine Learning Operations, come in. Think about it this way: nowadays, you wouldn’t dream of scaling a software development effort without practicing proper DevOps. Just as with DevOps for development, a means of standardizing processes and incorporating reliable and flexible automation is the key to success in scaling with Machine Learning.