Problem
Your microservices architecture has started experiencing intermittent request timeouts that cascade across services. Investigation reveals a distributed deadlock pattern: Service A holds a lock on Resource X and waits for Service B, while Service B holds a lock on Resource Y and waits for Service A.
The Scenario
You have three services involved in an order fulfillment flow:
- Order Service: Manages orders and orchestrates the fulfillment process.
- Inventory Service: Manages product stock levels with row-level database locks.
- Payment Service: Manages payment processing and holds payment authorization locks.
The fulfillment flow requires:
- Lock inventory (reserve items) -> Process payment -> Confirm inventory
- For refunds: Lock payment (reverse charge) -> Release inventory -> Confirm refund
The Deadlock: During a flash sale, these sequences interleave:
- Thread 1 (new order): Locks Inventory for Product A, then calls Payment Service (which is busy)
- Thread 2 (refund for Product A): Locks Payment authorization, then calls Inventory Service to release stock
- Thread 1 waits for Payment (blocked by Thread 2's payment lock)
- Thread 2 waits for Inventory (blocked by Thread 1's inventory lock)
- Both threads wait forever until they timeout after 30 seconds.
Your Task
- Detect: How would you detect that a distributed deadlock is occurring (not just slow responses)?
- Diagnose: What specific logs, metrics, and traces would you examine?
- Fix: Design at least two architectural solutions to prevent distributed deadlocks.
- Monitor: Build a detection system for distributed deadlocks.
Constraints
- Services communicate via HTTP (REST APIs).
- Each service uses its own PostgreSQL database.
- You cannot merge the services into a monolith.
- The fix must handle the flash sale scenario (high concurrency on the same resources).
- Lock timeouts are currently set to 30 seconds per service.