In organizations across every industry, teams are monitoring the systems and applications in their tech stacks to ensure they are all performing as expected. This practice of monitoring, understanding performance, and detecting root causes has become a critical piece of integrating services – bringing about observability tools like Grafana and Datadog that focus on centralizing logs, metrics, and traces.
In the decision science space, we often monitor the performance of a model in production to make sure that it’s up and running. But what happens when decision quality starts regressing or, worse, the model fails? How do you initially get notified about this performance degradation? Is it from business users in the field after it's causing issues in your real-world operations? How do teams investigate what’s causing the issue? How do you determine if it’s a problem with the model itself?
Let’s take a look at a few examples inspired by real events:
”Four years ago the data science team handed the algorithm to an engineer, and it got recoded… and implemented. Two weeks ago, they realized that it’s been broken for three and a half years.” Source
A change was made to a parameter setting that brought down a whole line of business. We only caught the mistake after seeing a spike in unassigned stops in production. Read more
On-time delivery suddenly plummeted to zero. We didn’t realize the value of our objective function had started to change drastically before it affected our real-time operations.
These are all situations where practicing observability would have helped teams prevent and minimize that “broken” time. But efficiently implementing observability can be challenging in the operations research space due to a lack of transparency and standardization across the lifecycle of decision models – from research and development through to production.
Why are decision model workflows challenging to observe?
Traditionally, decision models are developed and tested on the local machines of operations researchers, data scientists, and other modelers. This means that operators, eng teams, and even other modelers often do not have any line of sight into model updates. There are no logs, code branches, or versions to review.
When local development wraps, the models are then packaged up and handed off to DevOps teams for deployment to production. A new environment can mean new issues for a model – and with little transparency across teams, the “It was working on my machine…” conversation begins.
After the initial production launch, the teams exclaim: “Hey, we got it off the ground!” The DevOps teams monitor the model-as-a-service for uptime and general performance, but it’s harder to monitor model-level metrics without an understanding of the model itself (e.g., “Are the parameters configured correctly?”). And without a system of record for the model, their trail for investigating root cause gets complicated quickly (e.g., requiring scripts to inspect the impact of parameters), likely resulting in an urgent call to the modeler.
What is observability in the context of decision science?
The exact strategy and methodology for implementing observability depends on the systems you’re observing, how you’ve built them, and the types of questions you need to answer about them. Open Telemetry, the open source project adopted by tech industry leaders, gives this high-level explanation:
"Observability lets you understand a system from the outside by letting you ask questions about that system without knowing its inner workings. Furthermore, it allows you to easily troubleshoot and handle novel problems, that is, “unknown unknowns”. It also helps you answer the question “Why is this happening?”’
For operations research and decision science, monitoring service-level events such as outages and run-time issues makes sense. But decision models have special characteristics that require an additional layer of monitoring. Perhaps the most critical is decision quality. Similar to what we discussed in this CI/CD post, you have to address the question, “Is this decision good for the business today?”
Now that we’ve outlined the challenges and nuances, let’s bring it all back together and outline a few angles (or pillars) to consider when creating observable decision science systems.
Model / decision quality observability:
This is where we monitor performance of the model itself.
Example: The value of our objective function is degrading drastically, what changes were recently made to the model?
Answer questions such as:
- Are we hitting our other business-level KPIs?
- What component of the value function is causing it to degrade?
- What was the last change made to the model – and who made it?
- How long have we seen this drop in performance?
- What configuration was run?
Infrastructure / service observability:
This is where we monitor performance from a service perspective.
Example: We’re not able to connect to our model. Is our decision service online?
Answer questions such as:
- Why are we getting a connection error?
- Why is the model running for so long?
- Why aren’t we seeing decisions in our downstream systems?
Data / input observability:
This is where we monitor the data coming into the model.
Example: Solution quality is tanking, has something about the operation changed? Is there more demand?
Answer questions such as:
- Is our input data complete? Are we missing data?
- What’s the quality of the input data used for testing and production?
- Is recent input data represented in our tests? Are we using outdated data?
- Are data points falling outside of known bounds? (e.g., outside a range of latitude and longitude or outside a range of values)
When considering how to answer these types of questions, following a standardized workflow for developing, testing, shipping, and monitoring your decision models is a good place to start.
How does a DecisionOps platform like Nextmv help?
With DecisionOps, model development, testing, and deployment happen collaboratively in a shared space. Teams can see when a new version was created, review experiment results, and update models used in production through CI/CD. Both the model developers and the other stakeholders have insight throughout the entire workflow to better understand how a model is performing at any given time.
Create a system of record with a centralized view of all models and runs. Allow your entire team to easily access run metadata, history, and logs of your decision app. See who pushed the last executable, when a new version was cut, and incorporate CI/CD.
Test and tune your models with experimentation features. Perform acceptance tests, scenario tests (simulation), switchback tests, and shadow tests to understand how model changes impact KPIs and derisk rollout to live operations.
Collaborate with your team and key stakeholders. Point to experiment results and the individual runs in question by sharing a link that includes everything from statistics and charts to inputs.
Set up alerts via webhooks. Get notified when your model isn’t performing as expected to increase visibility and minimize downtime.
Get started
If you want more insight into how your models are performing (from development through to production), DecisionOps is the first step.
Sign up for a free Nextmv account to get started. Have questions? We’d love to help. Reach out via our community forum or contact us directly.