Training
Intermediate
Backend Software Engineer: Fix critical Python bug
Time available:45 minutes
Available in
Skills you'll learn
Incident Management and Response Fundamentals
Troubleshooting Production Issues
Training scores won't be added to your skill profile.
Your Role
On-Call Backend Software Engineer, Reporting Service
Your Goal
Fix a crashing Python aggregator during peak traffic.

Simulation Details: 
Simulation Title: "Backend Software Engineer: Fix critical Python bug"
Simulation Short Description: "Diagnose and fix a production-breaking Python bug under pressure."
Skills assessed in this simulation: "Incident Management and Response Fundamentals, Troubleshooting Production Issues"

Northbeam Analytics is a venture-backed B2B SaaS provider that delivers scheduled and on-demand analytics reports to roughly 180 mid-market and enterprise customers. Report delivery is production-critical: about two-thirds of enterprise tenants run an end-of-day close workflow that must receive reports during a narrow local time window. The failing service is a stateless Python microservice (“reports-service”) running in containers behind an internal API gateway and autoscaling from a small steady-state to many pods during predictable top-of-hour bursts. Those bursts drive concurrency from a few hundred to nearly a thousand in-flight requests, which makes request timeouts and crash-loop behavior immediately visible in customer-facing 500s.

The reporting pipeline is correctness-sensitive: it reads event data, aggregates rollups (group-by totals and derived metrics), and returns payloads downstream. Customers reconcile totals against ledgers, so any fix must preserve exact aggregation semantics for valid numeric inputs. Observability is standardized: structured JSON logs and a small set of core metrics are available (request 500 rate, TypeError count, p95/p99 latency, and container restarts per minute). SRE and Product are accountable for both operational safety and customer-facing guidance.

You are the on-call backend engineer for the reports service during an active incident: intermittent 500s during top-of-hour bursts caused by an unhandled TypeError in the aggregation hot loop that leads to worker exits and container restarts. You will work in the Code IDE on a single collaborative asset (a small Python project with reportsservice.py, aggregationmodule.py, and testreportsservice.py) using the single editor provided. All coordination happens one-on-one via chat or voice with two NPCs: María González, Senior Product Manager, and Rahul Mehta, the SRE on duty. NPCs cannot edit code; they only read or comment and provide context in one-on-one conversations.

Start by aligning with María: quantify customer impact, confirm non-negotiable correctness constraints, agree what short-term degradations (if any) are acceptable, and set a clear update cadence and a definition of “safe enough to ship” (stack-trace-aligned root cause, deterministic repro, minimal reversible patch). Then pair with Rahul while you investigate in the IDE: he will share sanitized stack traces and metric summaries so you can form a focused hypothesis about a data-shape/type mismatch that triggers a TypeError only under bursty volume. Your technical tasks are to reproduce the failure deterministically by adding a focused regression test that includes mixed-shape payloads (e.g., missing optional metrics, explicit nulls, numeric-as-string), implement the smallest defensible change in aggregationmodule.py (or its call site if truly necessary) that prevents the crash without altering correct totals or adding meaningful hot-path overhead, and validate everything by running the updated tests in the IDE.

After tests pass, return to María with a business-friendly explanation of root cause and confidence level based on stack-trace alignment and the deterministic repro, and align on a cautious rollout with Rahul: conservative canary exposure, concrete metrics to watch, and explicit rollback triggers tied to TypeError count, request 500 rate, container restarts per minute, and p95 latency. Remember that all communication is one-on-one, you must not use external tools, and only the single in-IDE editor and codebase are available for reproducing and fixing the issue.

You confirm impact and product guardrails with María and define what “safe enough to ship” looks like for Product.

You obtain sanitized stack traces and metric summaries from Rahul and form an evidence-backed hypothesis that points to the aggregation merge stage.

You add a deterministic regression test in testreportsservice.py that reproduces the TypeError using mixed-shape event payloads, and you keep existing clean-payload tests passing.

You implement a minimal, reversible code change to aggregationmodule.py (or its call site if required) that prevents the TypeError while preserving exact aggregation behavior for valid numeric inputs and avoiding meaningful hot-path overhead.

You validate the fix by running tests in the IDE and get agreement from Rahul on a conservative canary rollout and explicit rollback triggers (TypeError count, 500 rate, restarts per minute, p95 latency).

You brief María with a concise, evidence-based summary of root cause, confidence level, and safe customer messaging aligned to the agreed rollout plan.

All work is done through one-on-one conversations with María and Rahul and by editing the single collaborative code asset in the provided editor.
Helpful for
On-Call Backend Software Engineer, Reporting Service, Site Reliability Engineer, Senior Product Manager, Analytics Reporting
How it worksNot sure how it works? Watch the video below.