Celery Failure Modes

These are the most common failure patterns in Celery deployments. Sluice helps you detect and diagnose each one.

1. Silent task stalls

A task enters the active state and never completes — no success, no failure, no error. The worker appears healthy but the task hangs forever. Common causes:

No task_time_limit configured (Celery default: no timeout)
Task blocked on an external service that’s unresponsive (database, API, DNS)
Deadlock in task code

What Sluice shows: The job stays in active state with increasing duration. Filter by state=active and sort by startedAt ascending to find stalled jobs. Fix: Set task_time_limit and task_soft_time_limit in your Celery config:

task_soft_time_limit = 300  # Raise SoftTimeLimitExceeded after 5 min
task_time_limit = 600       # Hard-kill task after 10 min

2. Worker OOM kills

The Linux OOM killer terminates a worker process. The task that was running is lost — no failure event, no traceback, just silence. What Sluice shows: The job’s state is active, the worker transitions to offline, and the job never reaches completed or failed. A gap between worker-offline and the job’s last state change is a strong OOM signal. Fix:

Set worker_max_memory_per_child to recycle workers before they hit system limits
Use task_time_limit to kill tasks that grow unbounded
Profile your tasks for memory usage

3. Visibility timeout duplicates

When a task takes longer than Redis’s visibility_timeout (default: 1 hour), the broker assumes the worker died and redelivers the message. The original worker is still running the task — resulting in duplicate execution. What Sluice shows: Two active events for the same externalId. If you see the same task running concurrently on different workers, it’s a visibility timeout issue. Fix:

app.conf.broker_transport_options = {
    'visibility_timeout': 43200,  # 12 hours
}

Set the visibility timeout higher than your longest-running task. For tasks with very long execution times, consider using acks_late=True with careful retry configuration.

4. Prefetch blindness

Workers prefetch tasks from the broker into memory. These tasks are invisible — the broker reports them consumed, but they haven’t started executing. If the worker dies, all prefetched tasks are lost. What Sluice shows: Jobs transition from queued to a long pause before active, or the gap between task-received and task-started is unusually large. Fix:

worker_prefetch_multiplier = 1  # Default is 4

For long-running tasks, set worker_prefetch_multiplier = 1 so workers only grab one task at a time.

5. Broker disconnects

When the Redis broker goes down temporarily, workers lose their connection. Celery automatically reconnects, but event consumers may not resume correctly — leading to a gap in monitoring data. What Sluice shows (agent path): The agent logs reconnection attempts and resumes event subscription after reconnecting. Events during the disconnect window are lost (Redis PUB/SUB limitation), but the agent reconciles state by scanning keys on reconnection. What Sluice shows (SDK path): The SDK’s event forwarding continues normally after the worker reconnects to Redis. The worker handles reconnection — the SDK captures events from the worker’s perspective. Fix: The agent handles this automatically with exponential backoff. For prolonged Redis outages, ensure your Celery broker_connection_retry setting is True (default).

Get Started

Agent

SDK

Concepts

Troubleshooting

Celery Failure Modes

1. Silent task stalls

2. Worker OOM kills

3. Visibility timeout duplicates

4. Prefetch blindness

5. Broker disconnects

Get Started

Agent

SDK

Concepts

Troubleshooting

Documentation Index

​1. Silent task stalls

​2. Worker OOM kills

​3. Visibility timeout duplicates

​4. Prefetch blindness

​5. Broker disconnects

1. Silent task stalls

2. Worker OOM kills

3. Visibility timeout duplicates

4. Prefetch blindness

5. Broker disconnects