Problem
Design a background job processing system that reliably executes asynchronous tasks such as sending emails, generating reports, processing images, and syncing data with third-party APIs. The system must guarantee that every job is processed at least once, even in the face of worker crashes.
Requirements
- Job Submission: Producers can enqueue jobs with a type, payload, priority, and optional scheduled execution time (delayed jobs).
- Job Processing: Worker processes pick up jobs and execute them. Multiple workers can run concurrently.
- Retry Logic: Failed jobs are retried with exponential backoff. After a configurable maximum number of retries, the job is moved to a dead-letter queue (DLQ).
- Priority Queues: Support at least 3 priority levels (high, normal, low). Higher-priority jobs are processed before lower-priority ones.
- Job Status Tracking: Provide an API to query the current status of any job (pending, active, completed, failed, dead).
- Concurrency Control: Limit the number of concurrent jobs of a specific type (e.g., max 5 concurrent email sends).
Constraints
- The system should handle 500 jobs per second at peak.
- Jobs may take anywhere from 100ms to 30 minutes to complete.
- Worker crashes should not cause job loss (at-least-once delivery).
- The system should support at least 100 concurrent workers.
- Job payloads can be up to 1 MB in size.
What to Design
- The job lifecycle and state machine
- The storage backend and why you chose it
- How workers claim jobs without double-processing
- The retry and backoff strategy
- How you detect and handle stalled/stuck jobs