These are the most common failure patterns in Celery deployments. Sluice helps you detect and diagnose each one.Documentation Index
Fetch the complete documentation index at: https://docs.sluice.sh/llms.txt
Use this file to discover all available pages before exploring further.
1. Silent task stalls
A task enters theactive state and never completes — no success, no failure, no error. The worker appears healthy but the task hangs forever.
Common causes:
- No
task_time_limitconfigured (Celery default: no timeout) - Task blocked on an external service that’s unresponsive (database, API, DNS)
- Deadlock in task code
active state with increasing duration. Filter by state=active and sort by startedAt ascending to find stalled jobs.
Fix: Set task_time_limit and task_soft_time_limit in your Celery config:
2. Worker OOM kills
The Linux OOM killer terminates a worker process. The task that was running is lost — no failure event, no traceback, just silence. What Sluice shows: The job’s state isactive, the worker transitions to offline, and the job never reaches completed or failed. A gap between worker-offline and the job’s last state change is a strong OOM signal.
Fix:
- Set
worker_max_memory_per_childto recycle workers before they hit system limits - Use
task_time_limitto kill tasks that grow unbounded - Profile your tasks for memory usage
3. Visibility timeout duplicates
When a task takes longer than Redis’svisibility_timeout (default: 1 hour), the broker assumes the worker died and redelivers the message. The original worker is still running the task — resulting in duplicate execution.
What Sluice shows: Two active events for the same externalId. If you see the same task running concurrently on different workers, it’s a visibility timeout issue.
Fix:
acks_late=True with careful retry configuration.
4. Prefetch blindness
Workers prefetch tasks from the broker into memory. These tasks are invisible — the broker reports them consumed, but they haven’t started executing. If the worker dies, all prefetched tasks are lost. What Sluice shows: Jobs transition fromqueued to a long pause before active, or the gap between task-received and task-started is unusually large.
Fix:
worker_prefetch_multiplier = 1 so workers only grab one task at a time.
5. Broker disconnects
When the Redis broker goes down temporarily, workers lose their connection. Celery automatically reconnects, but event consumers may not resume correctly — leading to a gap in monitoring data. What Sluice shows (agent path): The agent logs reconnection attempts and resumes event subscription after reconnecting. Events during the disconnect window are lost (Redis PUB/SUB limitation), but the agent reconciles state by scanning keys on reconnection. What Sluice shows (SDK path): The SDK’s event forwarding continues normally after the worker reconnects to Redis. The worker handles reconnection — the SDK captures events from the worker’s perspective. Fix: The agent handles this automatically with exponential backoff. For prolonged Redis outages, ensure your Celerybroker_connection_retry setting is True (default).