Sidekiq Background Jobs Patterns

Sidekiq is deceptively simple to start with — drop a job class in app/jobs, call perform_async, and work moves to the background. But the gap between "it works in development" and "it works reliably at scale in production" is where most teams lose days or weeks to subtle failures. This guide covers the patterns and trade-offs that matter once your background jobs are carrying real weight: queue design that scales, retry behaviour and how to configure it without creating thundering herds, idempotency as a non-negotiable design constraint, job prioritisation strategies, error handling that surfaces problems instead of hiding them, dead job management, Redis configuration specifically for Sidekiq, monitoring job health, and the deploy-time concerns that bite you exactly once before you learn to plan for them. It connects to the broader Rails deployment topic, because background jobs are a deployment concern as much as a code concern — they run on your infrastructure, compete for your database connections, and fail in ways that are invisible to your users until the downstream consequences appear. I have run Sidekiq in production for applications processing anywhere from a few hundred jobs per day to several million, and the patterns in this guide come from things that actually broke.
Queue design
Sidekiq processes jobs from one or more named queues. The default setup — a single default queue for everything — works until it does not, and the failure mode is instructive.
Imagine you have two types of jobs: sending password reset emails (must complete within seconds) and generating monthly PDF reports (takes 30 seconds each). Both run on the default queue. When the report generation job runs for 200 users at the end of the month, it fills the queue. Password reset emails queue behind the reports and users wait minutes for an email that should arrive in seconds.
The fix is queue separation by latency requirement:
# config/sidekiq.yml
:queues:
- [critical, 6]
- [default, 3]
- [low, 1]
The numbers are weights, not strict priorities. A weight of 6 means Sidekiq checks the critical queue six times for every one time it checks low. This prevents starvation of low-priority queues while giving high-priority work preferential treatment.
A practical queue structure for most Rails applications:
- critical: password resets, payment confirmations, two-factor codes — anything a user is actively waiting for
- default: standard application work — webhooks, notification delivery, data syncing
- low: reports, analytics, batch processing, cleanup tasks — work that can wait minutes or hours
Do not create a queue per job class. That scales poorly and makes configuration unwieldy. Group by latency requirement, not by function. A "send email" job and a "create audit log" job that both need to run within seconds belong in the same queue, even though they do unrelated things.
One thing I have seen trip up multiple teams: creating a queue in your job class but forgetting to add it to sidekiq.yml. Sidekiq only processes queues it is told to process. Jobs enqueued to an unlisted queue sit there silently, forever, until someone notices the work is not happening. Check your queue list against your job classes periodically.
Retry behaviour
Sidekiq retries failed jobs with exponential backoff by default: 25 retries over approximately 21 days. The formula is (retry_count ** 4) + 15 + (rand(10) * (retry_count + 1)) seconds. The first retry happens in about 30 seconds. The 25th retry happens about 21 days after the initial failure.
This default is generous. Too generous for most applications. A job that fails 25 times is almost certainly not going to succeed on the 26th attempt. Meanwhile, it has been consuming retry slots and Redis memory for three weeks.
Tune retries per job class based on the failure mode you expect:
class PaymentWebhookJob
include Sidekiq::Job
sidekiq_options retry: 10 # external service might be temporarily down
def perform(webhook_id)
# ...
end
end
class ReportGenerationJob
include Sidekiq::Job
sidekiq_options retry: 3 # if it fails three times, it needs human attention
def perform(report_id)
# ...
end
end
Jobs that call external APIs benefit from more retries because transient network failures are common. Jobs that process local data and fail due to bugs benefit from fewer retries because retrying a bug does not fix it — it just delays the notification that something is wrong.
The randomised jitter in the backoff formula prevents thundering herds — if 500 jobs fail simultaneously (say, because a database went down for a minute), the retries spread across time rather than all hitting at once when the database comes back. Do not override the backoff formula to use fixed intervals unless you have a specific reason and understand this trade-off.
Idempotency
Idempotency means running a job twice produces the same result as running it once. This is not a nice-to-have in Sidekiq. It is a requirement.
Sidekiq uses Redis for job storage. Redis is fast but it is not a transactional database. Under certain conditions — a worker crash after completing work but before acknowledging the job, a Redis failover, a deploy that kills workers holding in-progress jobs — Sidekiq will re-execute a job that already partially or fully completed. The guarantee is at-least-once delivery, not exactly-once.
Making jobs idempotent:
Use database uniqueness constraints. If a job creates a record, add a unique index on the natural key so a duplicate execution raises an error you can catch gracefully instead of creating duplicate data.
Check before acting. Before sending an email, check whether it was already sent. Before charging a payment, check whether the charge already exists. Before updating a status, check whether it is already in the target state.
class SendInvoiceEmailJob
include Sidekiq::Job
def perform(invoice_id)
invoice = Invoice.find(invoice_id)
return if invoice.email_sent?
InvoiceMailer.send_invoice(invoice).deliver_now
invoice.update!(email_sent_at: Time.current)
end
end
Use idempotency keys for external API calls. Stripe, PayPal and most payment providers accept an idempotency key that ensures a request is processed only once, regardless of how many times you send it. Always pass one.
The cost of non-idempotent jobs is real. Double-charged customers, duplicate emails, corrupted data aggregations. And the trigger is not exotic — a routine deploy that restarts workers is enough to cause duplicate execution of in-progress jobs.
Job prioritisation
Prioritisation is not just about queue ordering. It is about deciding what work matters when your system is under pressure.
When everything is running smoothly, prioritisation is invisible — all jobs process quickly regardless of queue weight. Prioritisation becomes critical during backlogs: traffic spikes, outage recovery, or when a slow external dependency causes jobs to pile up.
Principles that hold up in practice:
User-facing work first. Any job where a user is actively waiting for the result (password reset email, real-time notification, payment confirmation) gets the highest priority. The user's experience degrades immediately when these are delayed.
Data integrity second. Jobs that maintain consistency between systems (webhook delivery, sync operations, audit logging) get medium priority. Delays are acceptable but data loss is not.
Analytics and reporting last. Aggregations, report generation, and non-time-sensitive batch processing can wait. If the system is under pressure, these can pause entirely without user impact.
Do not use Sidekiq's strict priority ordering (-q critical -q default -q low without weights) unless you genuinely want lower queues to starve completely while higher queues have work. In most applications, weighted ordering is more appropriate because even low-priority work should make progress.
Error handling
Sidekiq's default error handling is to catch the exception, log it, and schedule a retry. That is fine for transient errors but inadequate for errors that need human attention.
Layer your error handling:
Application-level rescue in the job. Catch specific exceptions you know how to handle. Let everything else propagate to Sidekiq's retry mechanism.
class ExternalApiJob
include Sidekiq::Job
sidekiq_options retry: 5
def perform(record_id)
record = Record.find(record_id)
ExternalApi.sync(record)
rescue ExternalApi::RateLimitError
# Re-raise to trigger retry with backoff
raise
rescue ExternalApi::InvalidDataError => e
# This won't fix itself on retry — log and move on
Rails.logger.error("Invalid data for record #{record_id}: #{e.message}")
record.update!(sync_status: 'failed', sync_error: e.message)
# Don't re-raise — no point retrying
end
end
Death handlers for dead jobs. When a job exhausts its retries and moves to the dead set, Sidekiq fires a death handler. Use it to alert your team.
Sidekiq.configure_server do |config|
config.death_handlers << ->(job, ex) do
ErrorNotifier.notify(
"Sidekiq job #{job['class']} died after #{job['retry_count']} retries",
exception: ex,
job_args: job['args']
)
end
end
Error tracking integration. Your error tracker (Sentry, Honeybadger) should capture Sidekiq failures automatically — most provide a Sidekiq middleware for this. Make sure the integration includes job arguments and queue name in the error context so you can debug without guessing.
Dead job management
Dead jobs are jobs that exhausted their retries without succeeding. By default, Sidekiq keeps 10,000 dead jobs for 6 months. These are not gone — they are sitting in Redis, consuming memory, waiting for you to either retry them manually or delete them.
Ignoring the dead set is how teams lose data. A bug in a webhook processing job causes 3,000 jobs to die over a week. Nobody checks. Three weeks later, someone notices that partner data is stale. The dead jobs contain the webhook payloads needed to fix it, but only if someone looks.
Operationalise dead job management:
- Monitor dead set size. Alert when it grows above a baseline. A sudden spike usually means a new bug or an external dependency failure.
- Review dead jobs weekly. Categorise them: bug (needs a code fix then bulk retry), external failure (retry once the dependency is back), data issue (needs manual investigation).
- Bulk retry after fixes. After deploying a fix for a bug that caused job deaths, retry the dead jobs from the Sidekiq Web UI or programmatically with
Sidekiq::DeadSet.new.each(&:retry). But only if the jobs are idempotent — and now you see why that section came first. - Purge old dead jobs intentionally. Do not let them accumulate to the 10,000 limit and silently push out newer dead jobs. Set a review cadence and clear what you have triaged.
Redis configuration for Sidekiq
Sidekiq stores all job data — enqueued jobs, in-progress jobs, scheduled jobs, retry sets, dead sets and statistical counters — in Redis. Your Redis configuration directly affects Sidekiq's reliability and performance.
Memory. Sidekiq's Redis usage scales with your job volume and argument size. A job with small arguments (a single integer ID) uses roughly 1-2 KB in Redis. A job with large serialised arguments can use much more. For most applications, 256 MB to 1 GB of Redis memory is sufficient. Monitor usage with INFO memory and set maxmemory with an eviction policy of noeviction — Sidekiq must never have its data evicted.
# redis.conf
maxmemory 512mb
maxmemory-policy noeviction
Persistence. Use Redis with AOF (Append Only File) persistence enabled if job loss on restart is unacceptable. RDB snapshots are faster but can lose up to the last snapshot interval of data. AOF with appendfsync everysec is a reasonable balance between durability and performance.
Connection pooling. Each Sidekiq server process opens multiple connections to Redis (default: concurrency + 5). If you run 4 Sidekiq processes with 10 threads each, that is 60 Redis connections from Sidekiq alone. Add your Rails application's connections for enqueuing jobs, and you need to ensure your Redis server can handle the total connection count.
Separate Redis instances. Do not share a Redis instance between Sidekiq and volatile caching. A cache eviction policy (allkeys-lru) will destroy your job data. Use a dedicated Redis instance for Sidekiq with noeviction and a separate instance for caching.
Monitoring job health
You cannot manage what you do not measure. Sidekiq provides metrics that tell you whether your background processing is healthy, degraded or failing.
Key metrics to monitor:
- Queue latency: the time between when a job is enqueued and when it starts executing. High latency means workers cannot keep up with job volume. This is the single most important metric because it directly represents how far behind you are.
- Queue depth: the number of jobs waiting in each queue. A growing queue that never shrinks means you need more workers or faster jobs.
- Processing rate: jobs processed per second. This should be roughly stable during normal operation. A drop indicates worker problems.
- Failure rate: the percentage of jobs that fail. A sudden spike indicates a new bug or external dependency issue.
- Dead set size: the number of jobs that exhausted retries. Growth here means unresolved problems.
- Redis memory usage: approaching the memory limit means risk of job loss or write failures.
Sidekiq Pro and Enterprise include built-in metrics. For the open-source version, export metrics using the Sidekiq API:
stats = Sidekiq::Stats.new
stats.processed # total jobs processed
stats.failed # total jobs failed
stats.enqueued # current total enqueued across all queues
stats.dead_size # current dead set size
Sidekiq::Queue.new("default").latency # seconds since oldest job was enqueued
Pipe these into your monitoring system (Prometheus, Datadog, Grafana) and set alerting thresholds. Queue latency above 30 seconds deserves investigation. Dead set growth above 10 per day deserves a review.
Deploy-time considerations
Deploying a new version of your application while Sidekiq is processing jobs creates a window where old code and new code coexist. This is where a surprising number of job-related bugs originate.
Graceful shutdown. When you deploy, send Sidekiq a TSTP signal (quiet) to stop fetching new jobs, wait for in-progress jobs to finish (up to the timeout), then send TERM to shut down. The default timeout is 25 seconds. If your longest-running job takes 60 seconds, increase the timeout or redesign the job to checkpoint its progress.
# systemd service file
ExecStop=/bin/kill -TSTP $MAINPID
TimeoutStopSec=30
Job argument compatibility. If you rename a job class or change its arguments in a deploy, any jobs enqueued by the old code will fail when the new code tries to deserialise them. The safe pattern: deploy argument changes in two phases. First, deploy new code that accepts both old and new argument formats. Second, after all old jobs have drained, remove the old format support.
Database migrations and jobs. If your deploy includes a migration that adds a column, and a job in the same deploy references that column, the job may execute before the migration runs. Run migrations before restarting Sidekiq, or write jobs defensively so they handle missing columns gracefully.
Queue draining before shutdown. For major changes (job class renames, queue restructuring), consider draining all queues before deploying. This eliminates the old-code/new-code coexistence problem entirely. It is slow but safe.
What usually goes wrong
Not making jobs idempotent. Then a deploy or Redis hiccup causes duplicate execution and corrupted data. This is the number one Sidekiq mistake across every codebase I have reviewed.
Putting everything in one queue. Then a batch of slow report jobs blocks password reset emails. Users complain. You add queues under pressure, which is not when you make your best architectural decisions.
Passing large objects as job arguments. Serialising an entire ActiveRecord object into a Sidekiq job argument bloats Redis memory and creates stale-data bugs — the object's state at enqueue time may differ from its state at execution time. Pass IDs and re-fetch from the database.
Ignoring the dead set for weeks. Failed jobs accumulate silently until someone notices missing data downstream.
Sharing Redis with volatile caching. The cache eviction policy deletes Sidekiq's job data. Angry debugging follows.
Not accounting for deploy-time job compatibility. A renamed job class causes NameError for every in-flight job. The dead set fills with jobs that need manual re-enqueuing after the rename is deployed.
Checklist summary
- Separate queues by latency requirement: critical, default, low
- Tune retry counts per job class based on expected failure modes
- Make every job idempotent — check before acting, use database constraints, use idempotency keys for external APIs
- Set up death handlers to alert on dead jobs
- Monitor queue latency, depth, failure rate and dead set size
- Use a dedicated Redis instance for Sidekiq with
noevictionpolicy - Configure graceful shutdown in your process manager with adequate timeout
- Deploy argument changes in two phases to avoid deserialisation failures
- Run database migrations before restarting Sidekiq
- Pass record IDs as job arguments, not serialised objects
- Review the dead set weekly and bulk retry after deploying fixes
Frequently asked questions
How many Sidekiq threads should I run?
Start with the default of 10 threads per process. If your jobs are I/O-bound (API calls, email delivery, database queries), you can increase to 15-25 threads. If jobs are CPU-bound (PDF generation, image processing), reduce to 5 or fewer. Monitor CPU and memory usage per process — if a Sidekiq process exceeds 80% CPU, you have too many threads.
Should I use Sidekiq Pro or Enterprise?
Sidekiq Pro is worth it once you rely on background jobs for business-critical operations. Reliable fetch (super_fetch), batches, and the improved Web UI justify the cost. Enterprise adds rate limiting, periodic jobs and rolling restarts, which matter at higher scale. The open-source version is excellent for starting out, but do not wait until you hit a limitation to evaluate Pro.
How do I handle jobs that take longer than the shutdown timeout?
Either increase the shutdown timeout to accommodate the longest job, or redesign long jobs to checkpoint progress. A job that processes 10,000 records can save its position every 100 records and resume from the checkpoint if restarted. This makes the job restartable without repeating work.
What happens if Redis goes down while Sidekiq is running?
Sidekiq workers lose the ability to fetch new jobs and acknowledge completed ones. In-progress jobs continue executing, but their completion may not be recorded. When Redis comes back, some jobs may re-execute. This is another reason idempotency is non-negotiable.
Can I use Sidekiq with SQLite instead of Redis?
No. Sidekiq requires Redis. If you want a database-backed job queue, look at GoodJob (PostgreSQL-backed) or SolidQueue (included in Rails 8). Both eliminate the Redis dependency at the cost of different performance characteristics and feature sets.
Related reading
- Rails Deployment — the parent topic covering the full deployment surface including background job infrastructure
- Deploy Ruby on Rails on a VPS — server setup including Sidekiq process management with systemd
- Debugging Production Rails Issues — diagnosing background job failures in the context of production incidents
- PostgreSQL Indexing for Rails — indexing strategies that keep your database-heavy jobs fast