What is shadow testing for optimization models and decision algorithms?

You’re pretty sure your staging model is ready for production. Pretty sure... But you want to test it under real-world conditions without real-world impact. Shadow testing gives your “understudy” algorithm the stage to prove its mettle.

This post is part of a series of blogs capturing our thinking around the various types of optimization model testing. Tell us what you think by sending us a message and make sure to watch our techtalk on this topic!

Shadow testing for decision models uses experimentation and deployment techniques to help decision algorithm developers to understand model performance using real-time (or online) data — without impacting production workflows. 

For example, let’s say you want to roll out a new vehicle routing algorithm that limits the number of stops a vehicle can service. You can provide that new model and your production model (that doesn't limit stops) with the same online data inputs simultaneously so that they both execute under production conditions. The key factor here is that your new model is running in “the shadows” and therefore disconnected from production systems. 

Shadow testing is sometimes associated with terms such as shadow traffic, dark launching, canary testing, shadow deployment, and even blue-green deployment. Regardless of the name you give it, shadow testing is a useful way to de-risk algorithm rollout and build confidence in deploying changes to your production systems. Let’s explore this type of testing a bit further.

What’s an example of shadow testing?

Imagine you work at a farm share company that delivers fresh produce (carrots, onions, beets, apples) from local farms to customers’ homes. Your company wants to incorporate more temperature-sensitive products such as goat cheese and milk. Only some of the vehicles in the company’s fleet have refrigeration capabilities. How do operations change if the business upgrades the entire fleet to be cold chain-ready?

You’ve run acceptance tests to see how the model changes using a standard set of inputs for low, medium, high order volumes. With each test, you validated KPIs for unassigned stops, number of vehicles used, and total time on road — checking for pass/fail results. 

Now you want to gain another level of confidence using real-time production data to find and evaluate edge cases or operational performance issues. It’s time for a shadow test. You’ve got your staging model running alongside your production model. Both use the same production inputs and run on the same frequency (every hour), but only the production model impacts operations. Staging only churns out results to analyze. You set this test to run for two weeks. 

During those two weeks, you’ve been monitoring the results regularly. You observe your staging model run longer than expected in a few places (but you figured out why), you saw both staging and production yield subpar results under some conditions (weekends are tricky), and staging mostly outperforms production across the KPIs you care about (just a few edge cases where it didn’t).   

Based on these results, you’re in a strong position to confidently update the entire vehicle fleet to support refrigerated goods delivery. You get buy-in from the product team and inform other business leaders of the results. Pass the cheese! 

Why do shadow testing?

Shadow testing is important for understanding model performance under real-world conditions over time — because the world isn’t static and things change, especially in relation to the input set you used when you developed your model. Folks in the machine learning world call this data drift. In essence, you want confidence that you aren’t going to break production when you roll out new model changes everywhere. 

By putting a decision model through the shadow test “paces” — same run frequency, input variations, and operational hiccups as production — it’s easier to make an informed decision about a new model roll out and de-risk as needed. There are certain times when you know something different is going to happen, but you don't have insight into what that something is going to be. For example, Valentine’s Day and Super Bowl Sunday tend to be strange days that break the mold in the food delivery space.

How often are you hitting edge cases? Does one bad Tuesday afternoon raise your eyebrow in concern? Is one region getting unexpected results compared to others? Did the system botch the generation of a new input and no solutions came back? Shadow testing streamlines the process of answering and moving forward on these questions. 

When do you need shadow testing?

You’re ready for shadow testing when you want to test your decision model or algorithm against real-world conditions without the real-world impact. (No need to blindly ship to production. Save your YOLO moments for another project.) Shadow testing is one of the final checks to perform before moving model changes to production. It’s likely that you’re pretty confident about your updated model, but want to really make sure you understand the edges of its performance compared to the production algorithm.

This tends to mean that you’ve already performed batch experiments or acceptance tests to quickly suss out results using a standard set of input files. But now, you’re ready to use online data from production systems over a defined period of time such as days or weeks. It’s likely you’re running multiple shadow tests on several markets or regions you operate in so that you’re accounting for variations within each domain. 

How is shadow testing traditionally done?

In order to execute shadow tests, you have to have two decision models you want to compare. They both need access to online data inputs. They can run on the same infrastructure or not. For example, some teams prefer to run their production systems in one place and their testing workloads in another. The important piece is that the candidate model should not impact production systems.

Shadow tests run over a period of time, so there needs to be intentional start and end dates or defined number of runs to complete. That doesn’t mean you can’t end a shadow test early, but ideally you give it days or weeks to run, depending on the nature of your operations. Lastly, you need a way to capture and compare model output to analyze and share the results, ideally through a UI. 

Teams that want to run shadow tests often need to build out and manage bespoke code and workflows to accomplish all of this. From standing up dedicated infrastructure to stitching together results, fiddling with feature flag tooling, and being fastidious about which environment is actually impacting production operations (don’t mix them up!) — traditional shadow testing methods are an investment. 

What’s next?

Our first cut at a shadow testing experience in Nextmv is on the way. We’re excited about it because it minimizes the amount of manual setup and management, streamlines the infrastructure willies, and ties into the larger testing framework we’re actively building out and making better all the time.

If you want to see testing in action, check out this testing walkthrough for a vehicle routing problem (VRP) and then try it yourself with a free Nextmv account. We welcome feedback in our community forum or on a call with our team.

Video by:
No items found.